Python 正则模块的应用

原创

Jeffrey_Shi 2013-08-10 20:24:30 ©著作权

©著作权归作者所有：来自51CTO博客作者Jeffrey_Shi的原创作品，请联系作者获取转载授权，否则将追究法律责任

Python在字符处理方面拥有很强大的优势，其中正则表达式是其中的精髓。可以利用正则表达式对字符串进行特定规则的查找，分割。本文对python在正则表达式方面的应用做了一个基本的总结。

python的re模块就是专门用于正则的模块，该模块主要有两部分组成。第一部分是正则匹配的规则，第二部分是re的相关函数。在引用这个模块之前，需要先导入该模块。

正则匹配特殊字符

匹配规则需要知道一些常用匹配字符的定义，常见的匹配字符归纳如下：

字符	功能
.	匹配除换行符'/n'外的所有字符
*	匹配前面规则0次或多次
+	匹配前面规则1次多多次
\|	或规则，表示满足其一就匹配，例如：dog\|cat表示匹配'dog'和'cat'
?	匹配前面规则0次或1次
^	匹配字符串的开头
$	匹配字符串的结尾
[]	字符集合，例如[abc],表示'a','b','c'都可以匹配
()	组，通常把相关的匹配规则写在组里
/d	匹配一个数字，/D表示取反，匹配一个非数字
/w	匹配所有英文字母和数字，/W是/w的取反
/s	匹配空格符，制表符，回车符等表示空格意义的字符
(?:)	重复字符匹配，例如(?:ab)表示匹配字符中重复的ab

2. re常用函数

findall(pattern,string,flags=0)

在string中查找pattern，返回结果是一个列表，存放符合规则的字符串，若没有找到找到符合规则的字符串，返回一个空列表。

compile(pattern, flags=0)

将正则规则编译成一个pattern对象，以供后续使用。返回一个pattern对象。

match(pattern,string,flags=0)

用pattern去匹配string的开头部分，若成功，返回一个match object, 不成功则返回一个None.

search(pattern,string, flags=0)

用pattern去匹配string,若成功，返回一个match object，不成功则返回一个None

split(pattern, string, flags=0)

用pattern规则在string中查找匹配的字符串，并用该字符作为分界，分割string,f返回一个列表。

3. 代码

用下面的例子简要介绍正则的相关应用并对代码做了一些注解。

import re #import re module

import string

txt = "Jeffrey is an engineer and work for NSN, he is cool, clever, and very smart....Cool! This is so wonderful word!"

txtlist = txt.split() # string.split module: split the txt using space as default, you can also use '\n',''\r', '\t', or even letters.

print txt

print txtlist

regexfortxt = re.compile('cool') #create a regex object for many operations

print regexfortxt.findall(txt) #find the regex in txt and return a list of non-overlapping matches in the string

print regexfortxt.search(txt) # scan through string looking for a match to the pattern, returning a match object, or None of no match was found

if regexfortxt.search(txt): # using the result of re.search to decide which step should be proceed

print "there is a regexfortxt in txt!"

else:

print "there is no a regexefortxt in txt!"

print regexfortxt.match(txt) # you can check that re.match only check the start of string and will return None when cannot find regex at the start of txt

regexsplit = regexfortxt.split(txt) #re.split module: split the string by the occurrences of pattern, return a list

print regexsplit

print len(regexsplit)

print '\n'

regexforlist = re.compile('(^.*er.*$|^.*re.*$)') # pattern a string which contain 'er' or 're' in it.

i = 0

while(i<len(txtlist)):

if regexforlist.search(txtlist[i]): # re.search is to pattern the list item with regex

print txtlist[i]

i = i + 1

continue

else:

i = i + 1

4. 输出

5. 注意

re.search 和re.match的区别：正如前面r常用模块所说，search函数是用规则去匹配整个string, 而match只匹配string的开头部分。
正则匹配只能匹配字符串类型的数据，也就是type()为'str'的数据，如果是需要去匹配一个列表，也就是type()为'list'的数据，需要分别匹配list里面的每个item, 因此要做一个循环语句来实现。