[普通]pcre_exec()详解

作者(passion) 更新(2017-07-28) 阅读(1988次) 评论(0) 分类( apache)

pcre_exec()的函数定义是

     int pcre_exec(const pcre *code, 
                 const pcre_extra *extra,  
                 const char *subject, 
                 int length, 
                 int startoffset,            
                 int options, 
                 int *ovector, 
                 int ovecsize);
     int rc;
     int ovector[30];        
     rc = pcre_exec(
           re,             /* pcre_compile()的结果 */
           NULL,           /* pcre_study()的结果，study可以加速算法，没有则设为NULL */
           "some string",  /* 匹配的字符串subject string，其中可以包含\0 */
           11,             /* 上述字符串长度，因为上面字符串可以包含\0，所以长度在这个地方指出 */
           0,              /* subject string开始匹配的offset，看api，貌似pcre不支持形如/g这样的匹配全部的选项，需要通过循环+调整这个偏移量，自己来实现这个功能 */
           0,              /* default options */
           ovector,        /* 匹配结果的数组*/
           30);            /* ovector的数组长度 */

The option bits are:

  PCRE_ANCHORED           Force pattern anchoring
  PCRE_AUTO_CALLOUT       Compile automatic callouts
  PCRE_BSR_ANYCRLF        \R matches only CR, LF, or CRLF
  PCRE_BSR_UNICODE        \R matches all Unicode line endings
  PCRE_CASELESS           Do caseless matching
  PCRE_DOLLAR_ENDONLY     $ not to match newline at end
  PCRE_DOTALL             . matches anything including NL
  PCRE_DUPNAMES           Allow duplicate names for subpatterns
  PCRE_EXTENDED           Ignore white space and # comments
  PCRE_EXTRA              PCRE extra features
                            (not much use currently)
  PCRE_FIRSTLINE          Force matching to be before newline
  PCRE_JAVASCRIPT_COMPAT  JavaScript compatibility
  PCRE_MULTILINE          ^ and $ match newlines within data
  PCRE_NEVER_UTF          Lock out UTF, e.g. via (*UTF)
  PCRE_NEWLINE_ANY        Recognize any Unicode newline sequence
  PCRE_NEWLINE_ANYCRLF    Recognize CR, LF, and CRLF as newline
                            sequences
  PCRE_NEWLINE_CR         Set CR as the newline sequence
  PCRE_NEWLINE_CRLF       Set CRLF as the newline sequence
  PCRE_NEWLINE_LF         Set LF as the newline sequence
  PCRE_NO_AUTO_CAPTURE    Disable numbered capturing paren-
                            theses (named ones available)
  PCRE_NO_AUTO_POSSESS    Disable auto-possessification
  PCRE_NO_START_OPTIMIZE  Disable match-time start optimizations
  PCRE_NO_UTF16_CHECK     Do not check the pattern for UTF-16
                            validity (only relevant if
                            PCRE_UTF16 is set)
  PCRE_NO_UTF32_CHECK     Do not check the pattern for UTF-32
                            validity (only relevant if
                            PCRE_UTF32 is set)
  PCRE_NO_UTF8_CHECK      Do not check the pattern for UTF-8
                            validity (only relevant if
                            PCRE_UTF8 is set)
  PCRE_UCP                Use Unicode properties for \d, \w, etc.
  PCRE_UNGREEDY           Invert greediness of quantifiers
  PCRE_UTF16              Run in pcre16_compile() UTF-16 mode
  PCRE_UTF32              Run in pcre32_compile() UTF-32 mode
  PCRE_UTF8               Run in pcre_compile() UTF-8 mode

返回值rc：

当rc<0表示匹配发生error，==0，没有匹配上，>0返回匹配到的元素数量

ovector是一个int型数组，其长度必须设定为3的倍数，若为3n，则最多返回n个元素，显然有rc<=n

其中ovector[0],[1]为整个匹配上的字符串的首尾偏移；其他[2*i][2*i+1]为对应第i个匹配上的子串的偏移,子串意思是正则表达式中被第i个()捕获的字符串，计数貌似是按照(出现的顺序。

如正则式/abc((.*)cf(exec))test/,在目标字符串11111abcword1cfexectest11111中匹配，将返回4个元素，其首尾偏移占用ovector的0~7位

元素0=abcword1cfexectest,

元素1=word1cfexec

元素2=word1

元素3=exec

ovector的最后1/3个空间，即[2n~3n-1]，貌似为pcre正则匹配算法预留，不返回结果

参考资料：http://swoolley.org/man.cgi/3/pcreapi

赏