pcre

TriggerTek Logo
abcdefghijklmnopqrstuvwxyz_
PCRE(3)								      PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

INTRODUCTION

       The  PCRE library is a set of functions that implement regular expres-
       sion pattern matching using the same syntax  and	 semantics  as	Perl,
       with just a few differences. (Certain features that appeared in Python
       and PCRE before they appeared in Perl are  also	available  using  the
       Python syntax.)

       The  current implementation of PCRE (release 7.x) corresponds approxi-
       mately with Perl 5.10, including support for UTF-8 encoded strings and
       Unicode	general	 category properties. However, UTF-8 and Unicode sup-
       port has to be explicitly enabled; it is not the default. The  Unicode
       tables correspond to Unicode release 5.0.0.

       In addition to the Perl-compatible matching function, PCRE contains an
       alternative matching function that matches the same compiled  patterns
       in a different way. In certain circumstances, the alternative function
       has some advantages. For a discussion of the two matching  algorithms,
       see the pcrematching page.

       PCRE  is	 written in C and released as a C library. A number of people
       have written wrappers and interfaces of various kinds. In  particular,
       Google  Inc.   have  provided a comprehensive C++ wrapper. This is now
       included as part of  the	 PCRE  distribution.  The  pcrecpp  page  has
       details	of  this interface. Other people’s contributions can be found
       in the Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details of exactly which Perl regular expression features are and  are
       not  supported  by  PCRE	 are  given  in	 separate  documents. See the
       pcrepattern and pcrecompat pages. There is a  syntax  summary  in  the
       pcresyntax page.

       Some  features  of PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it possible  for  a
       client  to  discover  which features are available. The features them-
       selves are described in the pcrebuild page. Documentation about build-
       ing PCRE for various operating systems can be found in the README file
       in the source distribution.

       The library contains a number of undocumented internal  functions  and
       data  tables  that  are used by more than one of the exported external
       functions, but which are not intended for  use  by  external  callers.
       Their  names all begin with "_pcre_", which hopefully will not provoke
       any name clashes. In some environments,	it  is	possible  to  control
       which  external	symbols	 are exported when a shared library is built,
       and in these cases the undocumented symbols are not exported.

USER DOCUMENTATION

       The user documentation for PCRE comprises a number of  different	 sec-
       tions. In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the index	page.
       In  the plain text format, all the sections are concatenated, for ease
       of searching. The sections are as follows:

	 pcre		   this document
	 pcre-config	   show PCRE installation configuration information
	 pcreapi	   details of PCRE’s native C API
	 pcrebuild	   options for building PCRE
	 pcrecallout	   details of the callout feature
	 pcrecompat	   discussion of Perl compatibility
	 pcrecpp	   details of the C++ wrapper
	 pcregrep	   description of the pcregrep command
	 pcrematching	   discussion of the two matching algorithms
	 pcrepartial	   details of the partial matching facility
	 pcrepattern	   syntax and semantics of supported
			     regular expressions
	 pcresyntax	   quick syntax reference
	 pcreperform	   discussion of performance issues
	 pcreposix	   the POSIX-compatible C API
	 pcreprecompile	   details of saving and  re-using  precompiled	 pat-
       terns
	 pcresample	   discussion of the sample program
	 pcrestack	   discussion of stack usage
	 pcretest	   description of the pcretest testing command

       In  addition, in the "man" and HTML formats, there is a short page for
       each C library function, listing its arguments and results.

LIMITATIONS

       There are some size limitations in PCRE but it is hoped that they will
       never in practice be relevant.

       The  maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
       is compiled with the default internal linkage size of 2. If  you	 want
       to  process  regular expressions that are truly enormous, you can com-
       pile PCRE with an internal linkage size of 3 or 4 (see the README file
       in  the	source	distribution  and  the	pcrebuild  documentation  for
       details). In these cases the limit is substantially larger.   However,
       the speed of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There  is  no  limit  to	 the number of parenthesized subpatterns, but
       there can be no more than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern  is  32  characters,
       and the maximum number of named subpatterns is 10000.

       The  maximum length of a subject string is the largest positive number
       that an integer variable can hold. However, when using the traditional
       matching	 function,  PCRE  uses	recursion  to  handle subpatterns and
       indefinite repetition.  This means that the available stack space  may
       limit  the  size	 of a subject string that can be processed by certain
       patterns. For a discussion of stack issues, see the pcrestack documen-
       tation.

UTF-8 AND UNICODE PROPERTY SUPPORT

       From  release  3.3,  PCRE  has  had some support for character strings
       encoded in the UTF-8 format. For release 4.0 this was greatly extended
       to  cover most common requirements, and in release 5.0 additional sup-
       port for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE to  include	UTF-8
       support	in  the	 code, and, in addition, you must call pcre_compile()
       with the PCRE_UTF8 option flag. When you do this, both the pattern and
       any  subject  strings that are matched against it are treated as UTF-8
       strings instead of just strings of bytes.

       If you compile PCRE with UTF-8 support, but do not use it at run time,
       the library will be a bit bigger, but the additional run time overhead
       is limited to testing the PCRE_UTF8 flag occasionally, so  should  not
       be very big.

       If  PCRE	 is  built  with  Unicode  character  property support (which
       implies UTF-8 support), the escape sequences \p{..},  \P{..},  and  \X
       are  supported.	 The available properties that can be tested are lim-
       ited to the general category properties such as Lu for an  upper	 case
       letter  or  Nd  for a decimal number, the Unicode script names such as
       Arabic or Han, and the derived properties Any and L&. A full  list  is
       given in the pcrepattern documentation. Only the short names for prop-
       erties are supported. For example, \p{L} matches a  letter.  Its	 Perl
       synonym,	 \p{Letter},  is  not  supported.  Furthermore, in Perl, many
       properties may optionally be prefixed by "Is", for compatibility	 with
       Perl 5.6. PCRE does not support this.

   Validity of UTF-8 strings

       When  you  set  the PCRE_UTF8 flag, the strings passed as patterns and
       subjects are (by default) checked for validity on entry to  the	rele-
       vant  functions.	 From release 7.3 of PCRE, the check is according the
       rules of RFC 3629, which are themselves derived from the Unicode spec-
       ification.  Earlier  releases  of PCRE followed the rules of RFC 2279,
       which allows the full range of 31-bit values (0	to  0x7FFFFFFF).  The
       current check allows only values in the range U+0 to U+10FFFF, exclud-
       ing U+D800 to U+DFFF.

       The excluded code points are the "Low Surrogate Area" of	 Unicode,  of
       which the Unicode Standard says this: "The Low Surrogate Area does not
       contain any character  assignments,  consequently  no  character	 code
       charts  or  namelists  are  provided  for  this	area.  Surrogates are
       reserved for use with UTF-16 and then must be used in pairs." The code
       points  that  are encoded by UTF-16 pairs are available as independent
       code points in the UTF-8 encoding. (In other words, the	whole  surro-
       gate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)

       If an invalid  UTF-8  string  is	 passed	 to  PCRE,  an	error  return
       (PCRE_ERROR_BADUTF8)  is	 given.	 In  some situations, you may already
       know that your strings are valid, and therefore	want  to  skip	these
       checks	in   order   to	  improve   performance.   If	you  set  the
       PCRE_NO_UTF8_CHECK flag at compile time or at run time,	PCRE  assumes
       that  the  pattern or subject it is given (respectively) contains only
       valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8
       string.

       If  you	pass  an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
       what happens depends on why the string is invalid. If the string	 con-
       forms  to the "old" definition of UTF-8 (RFC 2279), it is processed as
       a string of characters in the range 0 to 0x7FFFFFFF. In	other  words,
       apart  from  the initial validity test, PCRE (when in UTF-8 mode) han-
       dles strings according to the more liberal rules of RFC 2279. However,
       if  the	string does not even conform to RFC 2279, the result is unde-
       fined. Your program may crash.

       If you want to process strings of  values  in  the  full	 range	0  to
       0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
       set PCRE_NO_UTF8_CHECK to bypass the more restrictive  test.  However,
       in this situation, you will have to apply your own validity check.

   General comments about UTF-8 mode

       1.  An  unbraced	 hexadecimal escape sequence (such as \xb3) matches a
       two-byte UTF-8 character if the value is greater than 127.

       2. Octal numbers up to \777 are recognized, and match  two-byte	UTF-8
       characters for values greater than \177.

       3. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
       vidual bytes, for example: \x{100}{3}.

       4. The dot metacharacter matches one UTF-8 character instead of a sin-
       gle byte.

       5.  The escape sequence \C can be used to match a single byte in UTF-8
       mode, but its use can lead to some strange effects. This	 facility  is
       not available in the alternative matching function, pcre_dfa_exec().

       6.  The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
       test characters of any code value, but the characters that PCRE recog-
       nizes  as  digits,  spaces,  or word characters remain the same set as
       before, all with values less than 256. This  remains  true  even	 when
       PCRE  includes Unicode property support, because to do otherwise would
       slow down PCRE in many common cases. If you really want to test for  a
       wider sense of, say, "digit", you must use Unicode property tests such
       as \p{Nd}.

       7. Similarly, characters that match the POSIX named character  classes
       are all low-valued characters.

       8.  However, the Perl 5.10 horizontal and vertical whitespace matching
       escapes (\h, \H, \v, and \V) do	match  all  the	 appropriate  Unicode
       characters.

       9.  Case-insensitive  matching applies only to characters whose values
       are less than 128, unless PCRE is built with Unicode property support.
       Even  when  Unicode property support is available, PCRE still uses its
       own character tables when checking the case of low-valued  characters,
       so as not to degrade performance.  The Unicode property information is
       used only for characters with higher values. Even when  Unicode	prop-
       erty  support  is  available,  PCRE supports case-insensitive matching
       only when there is a one-to-one	mapping	 between  a  letter’s  cases.
       There are a small number of many-to-one mappings in Unicode; these are
       not supported by PCRE.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam magnet,
       so  I’ve	 taken it away. If you want to email me, use my two initials,
       followed by the two digits 10, at the domain cam.ac.uk.

REVISION

       Last updated: 09 August 2007
       Copyright (c) 1997-2007 University of Cambridge.



								      PCRE(3)