pcrematching

TriggerTek Logo
abcdefghijklmnopqrstuvwxyz_
PCREMATCHING(3)						      PCREMATCHING(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE MATCHING ALGORITHMS

       This  document  describes the two different algorithms that are avail-
       able in PCRE for matching a  compiled  regular  expression  against  a
       given  subject string. The "standard" algorithm is the one provided by
       the pcre_exec() function.  This works in the same was as Perl’s match-
       ing function, and provides a Perl-compatible matching operation.

       An  alternative algorithm is provided by the pcre_dfa_exec() function;
       this operates in a different way, and is not Perl-compatible.  It  has
       advantages and disadvantages compared with the standard algorithm, and
       these are described below.

       When there is only one possible way in which a  given  subject  string
       can  match  a pattern, the two algorithms give the same answer. A dif-
       ference arises, however, when there are	multiple  possibilities.  For
       example, if the pattern

	 ^<.*>

       is matched against the string

	 <something> <something else> <something further>

       there  are  three  possible answers. The standard algorithm finds only
       one of them, whereas the alternative algorithm finds all three.

REGULAR EXPRESSIONS AS TREES

       The set of strings that are matched by a	 regular  expression  can  be
       represented  as	a tree structure. An unlimited repetition in the pat-
       tern makes the tree of infinite size, but it is still a tree. Matching
       the  pattern  to	 a given subject string (from a given starting point)
       can be thought of as a search of the tree.   There  are	two  ways  to
       search  a tree: depth-first and breadth-first, and these correspond to
       the two matching algorithms provided by PCRE.

THE STANDARD MATCHING ALGORITHM

       In the terminology of Jeffrey Friedl’s book "Mastering Regular Expres-
       sions",	the  standard  algorithm is an "NFA algorithm". It conducts a
       depth-first search of the pattern tree. That is, it proceeds  along  a
       single  path  through the tree, checking that the subject matches what
       is required. When there is a mismatch, the algorithm tries any  alter-
       natives at the current point, and if they all fail, it backs up to the
       previous branch point in the tree,  and	tries  the  next  alternative
       branch  at  that	 level. This often involves backing up (moving to the
       left) in the subject string as well. The	 order	in  which  repetition
       branches	 are  tried is controlled by the greedy or ungreedy nature of
       the quantifier.

       If a leaf node is reached, a matching string has been  found,  and  at
       that point the algorithm stops. Thus, if there is more than one possi-
       ble match, this algorithm returns the first one that it finds. Whether
       this is the shortest, the longest, or some intermediate length depends
       on the way the greedy and ungreedy repetition quantifiers  are  speci-
       fied in the pattern.

       Because	it  ends  up  with  a  single  path  through  the tree, it is
       relatively straightforward for this algorithm to	 keep  track  of  the
       substrings that are matched by portions of the pattern in parentheses.
       This provides support for capturing parentheses and back references.

THE ALTERNATIVE MATCHING ALGORITHM

       This algorithm conducts a breadth-first search of the  tree.  Starting
       from  the  first	 matching  point in the subject, it scans the subject
       string from left to right, once, character by  character,  and  as  it
       does  this, it remembers all the paths through the tree that represent
       valid matches. In Friedl’s terminology, this is a kind of  "DFA	algo-
       rithm",	though	it  is	not implemented as a traditional finite state
       machine (it keeps multiple states active simultaneously).

       The scan continues until either the end of the subject is reached,  or
       there  are no more unterminated paths. At this point, terminated paths
       represent the different matching possibilities (if there are none, the
       match  has  failed).   Thus, if there is more than one possible match,
       this algorithm finds all of them, and  in  particular,  it  finds  the
       longest.	 In  PCRE, there is an option to stop the algorithm after the
       first match (which is necessarily the shortest) has been found.

       Note that all the matches that are found start at the  same  point  in
       the subject. If the pattern

	 cat(er(pillar)?)

       is  matched against the string "the caterpillar catchment", the result
       will be the three strings "cat", "cater", and "caterpillar" that start
       at  the	fourth character of the subject. The algorithm does not auto-
       matically move on to find matches that start at later positions.

       There are a number of features of PCRE regular  expressions  that  are
       not  supported by the alternative matching algorithm. They are as fol-
       lows:

       1. Because the algorithm finds all possible  matches,  the  greedy  or
       ungreedy	 nature of repetition quantifiers is not relevant. Greedy and
       ungreedy quantifiers are treated in exactly  the	 same  way.  However,
       possessive  quantifiers	can make a difference when what follows could
       also match what is quantified, for example in a pattern like this:

	 ^a++\w!

       This pattern matches "aaab!" but not "aaa!", which would be matched by
       a non-possessive quantifier. Similarly, if an atomic group is present,
       it is matched as if it were a standalone pattern at the current point,
       and  the longest match is then "locked in" for the rest of the overall
       pattern.

       2. When dealing with multiple paths through the	tree  simultaneously,
       it is not straightforward to keep track of captured substrings for the
       different matching possibilities, and PCRE’s  implementation  of	 this
       algorithm  does	not  attempt  to do this. This means that no captured
       substrings are available.

       3. Because no substrings are captured, back references within the pat-
       tern are not supported, and cause errors if encountered.

       4.  For the same reason, conditional expressions that use a backrefer-
       ence as the condition or test for a specific group recursion  are  not
       supported.

       5.  Because  many  paths through the tree may be active, the \K escape
       sequence, which resets the start of the match  when  encountered	 (but
       may  be	on some paths and not on others), is not supported. It causes
       an error if encountered.

       6. Callouts are supported, but the value of the capture_top  field  is
       always 1, and the value of the capture_last field is always -1.

       7. The \C escape sequence, which (in the standard algorithm) matches a
       single byte, even in UTF-8 mode, is not supported because the alterna-
       tive  algorithm	moves  through	the subject string one character at a
       time, for all active paths through the tree.

       8. None of the backtracking control verbs such as  (*PRUNE)  are	 sup-
       ported.

ADVANTAGES OF THE ALTERNATIVE ALGORITHM

       Using the alternative matching algorithm provides the following advan-
       tages:

       1. All possible matches (at a single point in the subject)  are	auto-
       matically  found,  and  in  particular, the longest match is found. To
       find more than one match using the standard algorithm, you have to  do
       kludgy things with callouts.

       2. There is much better support for partial matching. The restrictions
       on the content of the pattern that apply when using the standard algo-
       rithm  for partial matching do not apply to the alternative algorithm.
       For non-anchored patterns, the starting position of a partial match is
       available.

       3.  Because  the	 alternative  algorithm scans the subject string just
       once, and never needs to backtrack, it is possible to pass  very	 long
       subject	strings	 to the matching function in several pieces, checking
       for partial matching each time.

DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

       The alternative algorithm suffers from a number of disadvantages:

       1. It is substantially slower than the  standard	 algorithm.  This  is
       partly  because it has to search for all possible matches, but is also
       because it is less susceptible to optimization.

       2. Capturing parentheses and back references are not supported.

       3. Although atomic groups are supported, their use  does	 not  provide
       the performance advantage that it does for the standard algorithm.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

REVISION

       Last updated: 08 August 2007
       Copyright (c) 1997-2007 University of Cambridge.



							      PCREMATCHING(3)