Python 基础语法知识（三）

精选转载

bj学无止境 2012-10-12 13:36:54

七. 正则表达式http://www.csvt.net/
7.1 字符串替换
7.1.1.替换所有匹配的子串
用newstring替换subject中所有与正则表达式regex匹配的子串
示例：
import re
regex=&#39dba'
newstring=&#39oracle'
subject=&#39tianlesoftware is dba!'
result,number = re.subn(regex, newstring, subject)
print(&#39the regex result is %s, total change num: %d' %(result,number))

结果：
>>>
the regex result is tianlesoftware is oracle!, total change num: 1

7.1.2.替换所有匹配的子串（使用正则表达式对象）
reobj = re.compile(regex)
result, number = reobj.subn(newstring, subject)

7.2 字符串拆分
7.2.1.字符串拆分
result = re.split(regex, subject)

7.2.2.字符串拆分（使用正则表示式对象）
import reregex=&#39tianlesoftware is dba!&#39reobj = re.compile(regex)result = reobj.split(&#39dba')print(&#39the regex result is %s' %result)结果：>>>
the regex result is [&#39dba']

7.3 匹配
下面列出Python正则表达式的几种匹配用法：
7.3.1.测试正则表达式是否匹配字符串的全部或部分
regex=ur"..." #正则表达式
if re.search(regex, subject):
do_something()
else:
do_anotherthing()

7.3.2.测试正则表达式是否匹配整个字符串
regex=ur"...\Z" #正则表达式末尾以\Z结束
if re.match(regex, subject):
do_something()
else:
do_anotherthing()

7.3.3. 创建一个匹配对象，然后通过该对象获得匹配细节
regex=ur"..." #正则表达式
match = re.search(regex, subject)
if match:
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
do_something()
else:
do_anotherthing()

7.3.4.获取正则表达式所匹配的子串
(Get the part of a string matched by the regex)
regex=ur"..." #正则表达式
match = re.search(regex, subject)
if match:
result = match.group()
else:
result = ""

7.3.5. 获取捕获组所匹配的子串
(Get the part of a string matched by a capturing group)
regex=ur"..." #正则表达式
match = re.search(regex, subject)
if match:
result = match.group(1)
else:
result = ""

7.3.6. 获取有名组所匹配的子串
(Get the part of a string matched by a named group)
regex=ur"..." #正则表达式
match = re.search(regex, subject)
if match:
result = match.group("groupname")
else:
result = ""

7.3.7. 将字符串中所有匹配的子串放入数组中
(Get an array of all regex matches in a string)
result = re.findall(regex, subject)

7.3.8.遍历所有匹配的子串
(Iterate over all matches in a string)
for match in re.finditer(r"<(.*?)\s*.*?/\1>", subject)
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()

7.3.9.通过正则表达式字符串创建一个正则表达式对象
(Create an object to use the same regex for many operations)
reobj = re.compile(regex)

7.3.10. 用法1的正则表达式对象版本
（use regex object for if/else branch whether (part of) a string can be matched）
reobj = re.compile(regex)
if reobj.search(subject):
do_something()
else:
do_anotherthing()

7.3.11.用法2的正则表达式对象版本
（use regex object for if/else branch whether a string can be matched entirely）
reobj = re.compile(r"\Z")　＃正则表达式末尾以\Z 结束
if reobj.match(subject):
do_something()
else:
do_anotherthing()

7.3.12.创建一个正则表达式对象，然后通过该对象获得匹配细节http://www.csvt.net/
（Create an object with details about how the regex object matches (part of) a string）
reobj = re.compile(regex)
match = reobj.search(subject)
if match:
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
do_something()
else:
do_anotherthing()

7.3.13.用正则表达式对象获取匹配子串
（Use regex object to get the part of a string matched by the regex）
reobj = re.compile(regex)
match = reobj.search(subject)
if match:
result = match.group()
else:
result = ""

7.3.14.用正则表达式对象获取捕获组所匹配的子串
（Use regex object to get the part of a string matched by a capturing group）
reobj = re.compile(regex)
match = reobj.search(subject)
if match:
result = match.group(1)
else:
result = ""
7.3.15.用正则表达式对象获取有名组所匹配的子串
（Use regex object to get the part of a string matched by a named group）
reobj = re.compile(regex)
match = reobj.search(subject)
if match:
result = match.group("groupname")
else:
result = ""

7.3.16.用正则表达式对象获取所有匹配子串并放入数组
（Use regex object to get an array of all regex matches in a string）
reobj = re.compile(regex)
result = reobj.findall(subject)

7.3.17.通过正则表达式对象遍历所有匹配子串
（Use regex object to iterate over all matches in a string）
reobj = re.compile(regex)
for match in reobj.finditer(subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()

关于正则表达式的规则，参考帮助文档，这里只列出部分内容：
&#39.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in &#39foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in &#39foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
'*'
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?'
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>&#39. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>&#39.
{m}
Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six &#39a' characters, but not five.
{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 &#39a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand &#39a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string &#39aaaaaa', a{3,5} will match 5 &#39a' characters, while a{3,5}? will only match 3 characters.
'\'
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.
If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.
[]
Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a &#39-&#39. Special characters are not active inside sets. For example, [akm$] will match any of the characters &#39a', &#39k', &#39m', or '$' [a-z] will match any lowercase letter, and [a-zA-Z0-9] matches any letter or digit. Character classes such as \w or \S (defined below) are also acceptable inside a range, although the characters they match depends on whether ASCII or LOCALE mode is in force. If you want to include a ']' or a &#39-' inside a set, precede it with a backslash, or place it as the first character. The pattern []] will match ']', for example.
You can match the characters not within a range by complementing the set. This is indicated by including a '^' as the first character of the set; '^' elsewhere will simply match the '^' character. For example, [^5] will match any character except ?', and [^^] will match any character except '^&#39.
Note that inside [] the special forms and special characters lose their meanings and only the syntaxes described here are valid. For example, +, *, (, ), and so on are treated as literals inside [], and backreferences cannot be used inside [].

\d 匹配任何十进制数；它相当于类 [0-9]。
\D 匹配任何非数字字符；它相当于类 [^0-9]。
\s 匹配任何空白字符；它相当于类 [ \t\n\r\f\v]。
\S 匹配任何非空白字符；它相当于类 [^ \t\n\r\f\v]。
\w 匹配任何字母数字字符；它相当于类 [a-zA-Z0-9_]。
\W 匹配任何非字母数字字符；它相当于类 [^a-zA-Z0-9_]。