正则表达式语法讲解（二）

精选翻译

thejoyofcoding 2007-09-13 11:52:00 博主文章分类：开发-测试-SCM

The \A and \Z are just like "^'' and "$'', except that they won't match multiple times when the modifier /m is used, while "^'' and "$'' will match at every internal line separator.
\A and \Z的含义跟"^'' and "$''一样，但当使用/m时，他们不能匹配多次，而此时"^'' and "$''会在每个行分割符中匹配多次。

The ".'' metacharacter by default matches any character, but if You switch Off the modifier /s, then '.' won't match embedded line separators.
".''默认匹配任意一个字符，但如果你关闭/s，".''就不会匹配嵌入的行分隔符。

TRegExpr works with line separators as recommended at [url]www.unicode.org[/url] ( [url]http://www.unicode.org/unicode/reports/tr18/[/url] ):
在[url]www.unicode.org[/url] ( [url]http://www.unicode.org/unicode/reports/tr18/[/url] )有TregExpr与行分隔符一起工作的说明：

"^" is at the beginning of a input string, and, if modifier /m is On, also immediately following any occurrence of \x0D\x0A or \x0A or \x0D (if You are using Unicode version of TRegExpr, then also \x2028 or \x2029 or \x0B or \x0C or \x85). Note that there is no empty line within the sequence \x0D\x0A.
"^"位于输入字符串的开头，但是，如果/m是开启的，它会立即匹配跟随在\x0D\x0A or \x0A or \x0D后的字符串（如果你使用Unicode版本的TregExpr，那么也可以是\x2028 or \x2029 or \x0B or \x0C or \x85）。注意在\x0D\x0A序列中没有空行。

"$" is at the end of a input string, and, if modifier /m is On, also immediately preceding any occurrence of \x0D\x0A or \x0A or \x0D (if You are using Unicode version of TRegExpr, then also \x2028 or \x2029 or \x0B or \x0C or \x85). Note that there is no empty line within the sequence \x0D\x0A.
"$"位于输入字符串的结尾，但是，如果/m是开启的，它会立即匹配在\x0D\x0A or \x0A or \x0D前的字符串（如果你使用Unicode版本的TregExpr，那么也可以是\x2028 or \x2029 or \x0B or \x0C or \x85）注意在\x0D\x0A序列中没有空行。

      "." matchs any character, but if You switch Off modifier /s then "." doesn't match \x0D\x0A and \x0A and \x0D (if You are using Unicode version of TRegExpr, then also \x2028 and \x2029 and \x0B and \x0C and \x85).
      "."匹配任意一个字符，但是，如果关闭/s，那么"."不会匹配\x0D\x0A and \x0A and \x0D（如果你使用Unicode版本的TregExpr，那么也不会匹配\x2028 and \x2029 and \x0B and \x0C and \x85）

      Note that "^.*$" (an empty line pattern) doesnot match the empty string within the sequence \x0D\x0A, but matchs the empty string within the sequence \x0A\x0D.
注意"^.*$"（空行模式）不会匹配中间有\x0D\x0A序列的空字符串，但匹配中间有\x0A\x0D的空字符串。

Multiline processing can be easely tuned for Your own purpose with help of TRegExpr properties LineSeparators and LinePairedSeparator, You can use only Unix style separators \n or only DOS/Windows style \r\n or mix them together (as described above and used by default) or define Your own line separators!
借助于TregExpr的LineSeparators和LinePairedSeparator属性，可以很轻松处理多行的情形。你可以只使用Unix风格的\n分隔符或者只使用DOS/Windows风格的\r\n或者混合使用（如上描述一样并以默认的意义使用）或者定义你自己的行分隔符。

Metacharacters - predefined classes
元字符 – 预定义类

        \w     an alphanumeric character (including "_") 一个阿尔发字符（包括"_"）
        \W     a nonalphanumeric      非阿尔发字符
        \d     a numeric character      数字
        \D     a non-numeric          非数字
        \s     any space (same as [ \t\n\r\f])    任意空格（同[ \t\n\r\f]）
        \S     a non space             非空格

You may use \w, \d and \s within custom character classes.
你可以使\w,\d和\s在自定义字符类中。

      Examples:
      foob\dr     matchs strings like 'foob1r', ''foob6r' and so on but not 'foobar', 'foobbr' and so on
      foob[\w\s]r matchs strings like 'foobar', 'foob r', 'foobbr' and so on but not 'foob1r', 'foob=r' and so on
      foob\dr     匹配如'foob1r', ''foob6r'等字符串，除了'foobar', 'foobbr'等。
      foob[\w\s]r 匹配如'foobar', 'foob r', 'foobbr'等字符串，除了'foob1r', 'foob=r'等。

TRegExpr uses properties SpaceChars and WordChars to define character classes \w, \W, \s, \S, so You can easely redefine it.
TRegExpr 使用SpaceChars and WordChars熟悉定义字符类\w, \W, \s, \S，你可以轻松地重定义它。

Metacharacters - word boundaries
元字符 – 单词匹配

\b Match a word boundary 。匹配单词
\B Match a non-(word boundary) 匹配非单词

{TODO 不知道怎么翻译哦}
A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

Metacharacters - iterators
元字符 – 迭代符

      Any item of a regular expression may be followed by another type of metacharacters - iterators. Using this metacharacters You can specify number of occurences of previous character, metacharacter or subexpression.
      任意一个正在表达式可能跟有其他类型的元字符－迭代符。
      使用这种元字符可以匹配指定前面字符、元字符或者子表达式出现的次数的模式，

        *      zero or more ("greedy"), similar to {0,}    出现0次或以上，同{0,}
        +      one or more ("greedy"), similar to {1,}     出现1次或以上，同{1,}
        ?      zero or one ("greedy"), similar to {0,1}     出现0次或1次，同{0,1}，即要么匹配，要么不匹配。

{TODO 下面的翻译有点莫名其妙，要参考下别人是怎么翻译的啊}

        {n}    exactly n times ("greedy")               出现n次
        {n,}   at least n times ("greedy")                至少n次
        {n,m} at least n but not more than m times ("greedy")    n≤count≤
        *?     zero or more ("non-greedy"), similar to {0,}?    要么出现0次，要么出现一次非0
        +?     one or more ("non-greedy"), similar to {1,}?    要么出现一次，要么出现一次
        ??     zero or one ("non-greedy"), similar to {0,1}?    要么出现0次，要么出现一次1次
        {n}?   exactly n times ("non-greedy")                出现n次
        {n,}? at least n times ("non-greedy")                 要么出现n次，要么大于n
        {n,m}? at least n but not more than m times ("non-greedy")   要么出现一次大于n小于m，要么不出现。

      So, digits in curly brackets of the form {n,m}, specify the minimum number of times to match the item n and the maximum m. The form {n} is equivalent to {n,n} and matches exactly n times. The form {n,} matches n or more times. There is no limit to the size of n or m, but large numbers will chew up more memory and slow down r.e. execution.
      所以，{}中的数字，形如{n,m}，指定最小的次数n和最大的m。
      {n}形式等于{n,n}，即匹配确切的n次。
      {n,}形式匹配n次或更多次。
      对于n或者m的大小没有限制，但更大的数字将消耗更多的内存并降低r.e的执行速度。

If a curly bracket occurs in any other context, it is treated as a regular character.
如果{}出现在其他上下文，它被认为一个规则字符。

      Examples:
      foob.*r     matchs strings like 'foobar', 'foobalkjdflkj9r' and 'foobr'
             匹配如'foobar', 'foobalkjdflkj9r' and 'foobr'

      foob.+r     matchs strings like 'foobar', 'foobalkjdflkj9r' but not 'foobr'
             匹配如'foobar', 'foobalkjdflkj9r'，除了'foobr'
      foob.?r     matchs strings like 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'
             匹配如'foobar', 'foobbr' and 'foobr'，除了'foobalkj9r'
      fooba{2}r   matchs the string 'foobaar'
             匹配'foobaar'
      fooba{2,}r matchs strings like 'foobaar', 'foobaaar', 'foobaaaar' etc.
             匹配如'foobaar', 'foobaaar', 'foobaaaar'等
      fooba{2,3}r matchs strings like 'foobaar', or 'foobaaar' but not 'foobaaaar'
             匹配如'foobaar', or 'foobaaar'，除了'foobaaaar'

A little explanation about "greediness". "Greedy" takes as many as possible, "non-greedy" takes as few as possible. For example, 'b+' and 'b*' applied to string
'abbbbc' return 'bbbb', 'b+?' returns 'b', 'b*?' returns empty string, 'b{2,3}?' returns 'bb', 'b{2,3}' returns 'bbb'.
关于"greediness"的解释。"Greedy"是匹配出现的最多的情况，而"non-greedy"只匹配出现最少的情况。比如，'b+' and 'b*'对于'abbbbc'分别返回'bbbb'，'b {2,3}?'对'abbbbc'返回'bb'，'b{2,3}'对'abbbbc'返回'bbb'

You can switch all iterators into "non-greedy" mode (see the modifier /g).
你可以是所有迭代符为"non-greedy"模式（参阅修改符/g）

Metacharacters – alternatives
元字符 – 可选符

You can specify a series of alternatives for a pattern using "|'' to separate them, so that fee|fie|foe will match any of "fee'', "fie'', or "foe'' in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ("('', "['', or the beginning of the pattern) up to the first "|'', and the last alternative contains everything from the last "|'' to the next pattern delimiter. For this reason, it's common practice to include alternatives in parentheses, to minimize confusion about where they start and end.
你可以对一个模式使用"|''分割的一系列你指定的可选符，这样fee|fie|foe将匹配目标串中任意的"fee'', "fie'', or "foe''（f(e|i|o)e同）。

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against "barefoot'', only the "foo'' part will match, as that is the first alternative tried, and it successfully matches the target string. (This might not seem important, but it is important when you are capturing matched text using parentheses.)

Also remember that "|'' is interpreted as a literal within square brackets, so if You write [fee|fie|foe] You're really only matching [feio|].

Examples:
foo(bar|foo) matchs strings 'foobar' or 'foofoo'.

Metacharacters – subexpressions
元字符 – 子表达式

The bracketing construct ( ... ) may also be used for define r.e. subexpressions (after parsing You can find subexpression positions, lengths and actual values in MatchPos, MatchLen and Match properties of TRegExpr, and substitute it in template strings by TRegExpr.Substitute).
(…)也用于定义正在表达式的子表达式（在解析你能找到的子表达式位置后， TRegExpr的MatchPos, MatchLen and Match属性存储了你找到的子表达式位置、长度和匹配，同时用TRegExpr.Substitute替代它们）

      Subexpressions are numbered based on the left to right order of their opening parenthesis.
      First subexpression has number '1' (whole r.e. match has number '0' - You can substitute it in TRegExpr.Substitute as '$0' or '$&').
      子表达式按在括号中从左到右顺序编号。
      第一个子表达式为'1'（整个正则表达式为'0' －你可以在TRegExpr.Substitute用'$0' or '$&'替换它）

      Examples:
      (foobar){8,10} matchs strings which contain 8, 9 or 10 instances of the 'foobar'
                匹配出现的8, 9 or 10个'foobar'
       foob([0-9]|a+)r matchs 'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar' etc.
                匹配'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar'等

Metacharacters - backreferences
元字符 – backreferences

Metacharacters \1 through \9 are interpreted as backreferences. \<n> matches previously matched subexpression #<n>.
元字符\1到\9被解释为backreferences。\<n>匹配前面匹配的子表达式#<n>。

      Examples:
        (.)\1+         matchs 'aaaa' and 'cc'.            匹配'aaaa' and 'cc'
        (.+)\1+        also match 'abab' and '123123'     匹配'abab' and '123123'
        (['"]?)(\d+)\1 matchs '"13" (in double quotes), or '4' (in single quotes) or 77 (without quotes) etc
                匹配'"13"（两个引号），或者'4'（一个引号）或者77（没有引号）等