------------------------------
1. The maximum number of capturing subpatterns is 65535 (documented), but no
check on this was ever implemented. This omission has been rectified; it fixes
ClusterFuzz 14376.
2. Improved the invalid utf32 support of the JIT compiler. Now it correctly
detects invalid characters in the 0xd800-0xdfff range.
3. Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
4. Add support for matching in invalid UTF strings to the pcre2_match()
interpreter, and integrate with the existing JIT support via the new
PCRE2_MATCH_INVALID_UTF compile-time option.
5. Give more error detail for invalid UTF-8 when detected in pcre2grep.
6. Add support for invalid UTF-8 to pcre2grep.
7. Adjust the limit for "must have" code unit searching, in particular,
increase it substantially for non-anchored patterns.
8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero
minimum is potentially useful.
9. Some changes to the way the minimum subject length is handled:
* When PCRE2_NO_START_OPTIMIZE is set, no minimum length is computed;
pcre2test now omits this item instead of showing a value of zero.
* An incorrect minimum length could be calculated for a pattern that
contained (*ACCEPT) inside a qualified group whose minimum repetition was
zero, for example /A(?:(*ACCEPT))?B/, which incorrectly computed a minimum
of 2. The minimum length scan no longer happens for a pattern that
contains (*ACCEPT).
* When no minimum length is set by the normal scan, but a first and/or last
code unit is recorded, set the minimum to 1 or 2 as appropriate.
* When a pattern contains multiple groups with the same number, a back
reference cannot know which one to scan for a minimum length. This used to
cause the minimum length finder to give up with no result. Now it treats
such references as not adding to the minimum length (which it should have
done all along).
* Furthermore, the above action now happens only if the back reference is to
a group that exists more than once in a pattern instead of any back
reference in a pattern with duplicate numbers.
10. A (*MARK) value inside a successful condition was not being returned by the
interpretive matcher (it was returned by JIT). This bug has been mended.
11. A bug in pcre2grep meant that -o without an argument (or -o0) didn't work
if the pattern had more than 32 capturing parentheses. This is fixed. In
addition (a) the default limit for groups requested by -o<n> has been raised to
50, (b) the new --om-capture option changes the limit, (c) an error is raised
if -o asks for a group that is above the limit.
12. The quantifier {1} was always being ignored, but this is incorrect when it
is made possessive and applied to an item in parentheses, because a
parenthesized item may contain multiple branches or other backtracking points,
for example /(a|ab){1}+c/ or /(a+){1}+a/.
13. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the
existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters.
14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
if the end of the subject was encountered in a lookahead (conditional or
otherwise), an atomic group, or a recursion.
15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
16. Check for integer overflow when computing lookbehind lengths. Fixes
Clusterfuzz issue 15636.
17. Implemented non-atomic positive lookaround assertions.
18. If a lookbehind contained a lookahead that contained another lookbehind
within it, the nested lookbehind was not correctly processed. For example, if
/(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching
"b".
19. Implemented pcre2_get_match_data_size().
20. Two alterations to partial matching:
(a) The definition of a partial match is slightly changed: if a pattern
contains any lookbehinds, an empty partial match may be given, because this
is another situation where adding characters to the current subject can
lead to a full match. Example: /c*+(?<=[bc])/ with subject "ab".
(b) Similarly, if a pattern could match an empty string, an empty partial
match may be given. Example: /(?![ab]).*/ with subject "ab". This case
applies only to PCRE2_PARTIAL_HARD.
(c) An empty string partial hard match can be returned for \z and \Z as it
is documented that they shouldn't match.
21. A branch that started with (*ACCEPT) was not being recognized as one that
could match an empty string.
22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
char * instead of const uint8_t *, as generated by pcre2_maketables().
23. Upgraded to Unicode 12.1.0.
24. Add -jitfast command line option to pcre2test (to make all the jit options
available directly).
25. Make pcre2test -C show if libreadline or libedit is supported.
26. If the length of one branch of a group exceeded 65535 (the maximum value
that is remembered as a minimum length), the whole group's length was
incorrectly recorded as 65535, leading to incorrect "no match" when start-up
optimizations were in force.
27. The "rightmost consulted character" value was not always correct; in
particular, if a pattern ended with a negative lookahead, characters that were
inspected in that lookahead were not included.
28. Add the pcre2_maketables_free() function.
29. The start-up optimization that looks for a unique initial matching
code unit in the interpretive engines uses memchr() in 8-bit mode. When the
search is caseless, it was doing so inefficiently, which ended up slowing down
the match drastically when the subject was very long. The revised code (a)
remembers if one case is not found, so it never repeats the search for that
case after a bumpalong and (b) when one case has been found, it searches only
up to that position for an earlier occurrence of the other case. This fix
applies to both interpretive pcre2_match() and to pcre2_dfa_match().
30. While scanning to find the minimum length of a group, if any branch has
minimum length zero, there is no need to scan any subsequent branches (a small
compile-time performance improvement).
31. Installed a .gitignore file on a user's suggestion. When using the svn
repository with git (through git svn) this helps keep it tidy.
32. Add underflow check in JIT which may occur when the value of subject
string pointer is close to 0.
33. Arrange for classes such as [Aa] which contain just the two cases of the
same character, to be treated as a single caseless character. This causes the
first and required code unit optimizations to kick in where relevant.
34. Improve the bitmap of starting bytes for positive classes that include wide
characters, but no property types, in UTF-8 mode. Previously, on encountering
such a class, the bits for all bytes greater than \xc4 were set, thus
specifying any character with codepoint >= 0x100. Now the only bits that are
set are for the relevant bytes that start the wide characters. This can give a
noticeable performance improvement.
35. If the bitmap of starting code units contains only 1 or 2 bits, replace it
with a single starting code unit (1 bit) or a caseless single starting code
unit if the two relevant characters are case-partners. This is particularly
relevant to the 8-bit library, though it applies to all. It can give a
performance boost for patterns such as [Ww]ord and (word|WORD). However, this
optimization doesn't happen if there is a "required" code unit of the same
value (because the search for a "required" code unit starts at the match start
for non-unique first code unit patterns, but after a unique first code unit,
and patterns such as a*a need the former action).
36. Small patch to pcre2posix.c to set the erroroffset field to -1 immediately
after a successful compile, instead of at the start of matching to avoid a
sanitizer complaint (regexec is supposed to be thread safe).
37. Add NEON vectorization to JIT to speed up matching of first character and
pairs of characters on ARM64 CPUs.
38. If a non-ASCII character was the first in a starting assertion in a
caseless match, the "first code unit" optimization did not get the casing
right, and the assertion failed to match a character in the other case if it
did not start with the same code unit.
39. Fixed the incorrect computation of jump sizes on x86 CPUs in JIT. A masking
operation was incorrectly removed in r1136. Reported by Ralf Junker.