python使用regexp python region

转载

mob64ca13f34c58 2024-08-20 23:22:23

文章标签 python使用regexp python 正则表达式 bc 字符串 文章分类 Python 后端开发

1. 为了与re模块兼容，此模块具有2个行为

2. Unicode中不区分大小写的匹配：Case-insensitive matches

3. Flags

4. 组

5. 其他功能，如下表

regex正则表达式实现与标准“ re”模块向后兼容，但提供了其他功能。

re模块的零宽度匹配行为是在Python 3.7中更改的，并且为Python 3.7编译时，此模块将遵循该行为。

1. 为了与re模块兼容，此模块具有2个行为

Version 0：(old behaviour，与re模块兼容):
Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.

Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.
Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:

.split won’t split a string at a zero-width match.
.sub will advance by one character after a zero-width match.

Inline flags apply to the entire pattern, and they can’t be turned off.
Only simple sets are supported.
Case-insensitive matches in Unicode use simple case-folding by default.

Version 1：(new behaviour, possibly different from the re module):

Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.
Zero-width matches are handled correctly.
Inline flags apply to the end of the group or pattern, and they can be turned off.
Nested sets and set operations are supported.
Case-insensitive matches in Unicode use full case-folding by default.

如果未指定版本，则regex模块将默认为regex.DEFAULT_VERSION。

2. Unicode中不区分大小写的匹配：Case-insensitive matches

regex模块支持简单和完整的大小写折叠，以实现Unicode中不区分大小写的匹配。可以使用FULLCASE或F标志或模式中的（？f）来打开完整的大小写折叠。请注意，该标志会影响IGNORECASE标志的工作方式。FULLCASE标志本身不会打开不区分大小写的匹配。

在版本0行为中，默认情况下该标志处于关闭状态。
在版本1行为中，默认情况下该标志处于启用状态。

3. Flags

标志有2种：局部标志和全局标志。范围标志只能应用于模式的一部分，并且可以打开或关闭；全局标志适用于整个模式，只能将其打开。

局部标志： FULLCASE, IGNORECASE, MULTILINE, DOTALL, VERBOSE, WORD.

全局标志：ASCII, BESTMATCH, ENHANCEMATCH, LOCALE, POSIX, REVERSE, UNICODE, VERSION0, VERSION1.

如果未指定ASCII，LOCALE或UNICODE标志，则如果正则表达式模式为Unicode字符串，则默认为UNICODE；如果为字节字符串，则默认为ASCII。

ENHANCEMATCH标志进行模糊匹配，以提高找到的下一个匹配的匹配度。
BESTMATCH标志使模糊匹配搜索最佳匹配而不是下一个匹配。

4. 组

所有捕获组都有一个组号，从1开始。具有相同组名的组将具有相同的组号，而具有不同组名的组将具有不同的组号。

同一名称可由多个组使用，以后的捕获“覆盖”较早的捕获。该组的所有捕获都可以通过match对象的captures方法获得。

组号将在分支重置的不同分支之间重用，例如。(?|(first)|(second))仅具有组1。如果捕获组具有不同的组名，则它们当然将具有不同的组号，例如，(?|(?P<foo>first)|(?P<bar>second)) 具有组1 (“foo”) 和组2 (“bar”).

正则表达式： (\s+)(?|(?P<foo>[A-Z]+)|(\w+)) (?P<foo>[0-9]+) 有2组

(\s+) is group 1.
(?P<foo>[A-Z]+) is group 2, also called “foo”.
(\w+) is group 2 because of the branch reset.
(?P<foo>[0-9]+) is group 2 because it’s called “foo”.

5. 其他功能，如下表

模式	描述
\m \M \b	单词起始位置、结束位置、分界位置 regex用\m表示单词起始位置，用\M表示单词结束位置。 \b：是单词分界位置，但不能区分是起始还是结束位置。
(?flags-flags:...) 局部 (?flags-flags) 全局	局部范围控制： (?i:)是打开忽略大小写，(?-i:)则是关闭忽略大小写。如果有多个flag挨着写既可，如(?is-f:)：减号左边的是打开，减号右边的是关闭。 >>> regex.search(r"<B>(?i:good)</B>", "<B>GOOD</B>") <regex.Match object; span=(0, 11), match='<B>GOOD</B>'> 全局范围控制： (?si-f)<B>good</B>
lookaround	对条件模式中环顾四周的支持： >>> regex.match(r'(?(?=\d)\d+\|\w+)', '123abc') <regex.Match object; span=(0, 3), match='123'> >>> regex.match(r'(?(?=\d)\d+\|\w+)', 'abc123') <regex.Match object; span=(0, 6), match='abc123'> 这与在一对替代方案的第一个分支中进行环视不太一样： >>> print(regex.match(r'(?:(?=\d)\d+\b\|\w+)', '123abc')) # 若分支1不匹配，尝试第2个分支 <regex.Match object; span=(0, 6), match='123abc'> >>> print(regex.match(r'(?(?=\d)\d+\b\|\w+)', '123abc')) # 若分支1不匹配，不尝试第2个分支 None
(?p) POSIX匹配（最左最长）	正常匹配： >>> regex.search(r'Mr\|Mrs', 'Mrs') <regex.Match object; span=(0, 2), match='Mr'> >>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient') <regex.Match object; span=(0, 7), match='oneself'> POSIX匹配： >>> regex.search(r'(?p)Mr\|Mrs', 'Mrs') <regex.Match object; span=(0, 3), match='Mrs'> >>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient') <regex.Match object; span=(0, 17), match='oneselfsufficient'>
[[a-z]--[aeiou]]	V0：simple sets，与re模块兼容 V1：nested sets，功能增强，集合包含'a'-'z'，排除“a”, “e”, “i”, “o”, “u” eg： (?V1)[[a-z]--[aeiou]]+', 'abcde') 或 flags=regex.V1) <regex.Match object; span=(1, 4), match='bcd'>
(?(DEFINE)...)	命名组内容及名字：如果没有名为“ DEFINE”的组，则…将被忽略，但只要有任何组定义，(?(DEFINE))将起作用。 eg： >>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants') <regex.Match object; span=(0, 11), match='5 elephants'> # 卡两头为固定样式、中间随意的内容 >>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog') <regex.Match object; span=(0, 8), match='123哈哈dog'>
\K	保留K出现位置之后的匹配内容，丢弃其之前的匹配内容。 >>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef') <regex.Match object; span=(2, 5), match='cde'> 保留cde，丢弃ab >>> m[0] 'cde' >>> m[1] 'abcde' >>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef') <regex.Match object; span=(1, 3), match='bc'> 反向，保留bc，丢弃def >>> m[0] 'bc' >>> m[1] 'bcdef'
(?r) 反向搜索	`>>> regex.findall(r".", "abc") ['a', 'b', 'c'] >>> regex.findall(r"(?r).", "abc") ['c', 'b', 'a']` 注意：反向搜索的结果不一定与正向搜索相反 `>>> regex.findall(r"..", "abcde") ['ab', 'cd'] >>> regex.findall(r"(?r)..", "abcde") ['de', 'bc']`
expandf	使用下标来获取重复捕获组的所有捕获 >>> m = regex.match(r"(\w)+", "abc") >>> m.expandf("{1}") 'c' m.expandf("{1}") == m.expandf("{1[-1]}") 后面的匹配覆盖前面的匹配，所以{1}=c >>> m.expandf("{1[0]} {1[1]} {1[2]}") 'a b c' >>> m.expandf("{1[-1]} {1[-2]} {1[-3]}") 'c b a' 定义组名 >>> m = regex.match(r"(?P<letter>\w)+", "abc") >>> m.expandf("{letter}") 'c' >>> m.expandf("{letter[0]} {letter[1]} {letter[2]}") 'a b c' >>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}") 'c b a' >>> m = regex.match(r"(\w+) (\w+)", "foo bar") >>> m.expandf("{0} => {2} {1}") 'foo bar => bar foo' >>> m = regex.match(r"(?P<word1>\w+) (?P<word2>\w+)", "foo bar") >>> m.expandf("{word2} {word1}") 'bar foo' 同样可以用于search()方法
capturesdict() groupdict() captures()	capturesdict() 是 groupdict() 和 captures()的结合： groupdict()：返回一个字典，key = 组名，value = 匹配的最后一个值 captures()：返回一个所有匹配值的列表 capturesdict()：返回一个字典，key = 组名，value = 所有匹配值的列表 >>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n") >>> m.groupdict() {'word': 'three', 'digits': '3'} >>> m.captures("word") ['one', 'two', 'three'] >>> m.captures("digits") ['1', '2', '3'] >>> m.capturesdict() {'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}
访问组的方式	（1）通过下标、切片访问： >>> m = regex.search(r"(?P<before>.?)(?P<num>\d+)(?P<after>.)", "pqr123stu") >>> m["before"] pqr >>> len(m) 4 >>> m[:] ('pqr123stu', 'pqr', '123', 'stu') （2）通过group("name")访问： >>> m.group('num') '123' （3）通过组序号访问： >>> m.group(0) 'pqr123stu' >>> m.group(1) 'pqr'
subf subfn	subf和subfn分别是sub和subn的替代方案。当传递替换字符串时，他们将其视为格式字符串。 >>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar") 'foo bar => bar foo' >>> regex.subf(r"(?P<word1>\w+) (?P<word2>\w+)", "{word2} {word1}", "foo bar") 'bar foo'
partial	部分匹配：match、search、fullmatch、finditer都支持部分匹配，使用partial关键字参数设置。匹配对象有一个pattial参数，当部分匹配时返回True，完全匹配时返回False >>> regex.search(r'\d{4}', '12', partial=True) <regex.Match object; span=(0, 2), match='12', partial=True> >>> regex.search(r'\d{4}', '123', partial=True) <regex.Match object; span=(0, 3), match='123', partial=True> >>> regex.search(r'\d{4}', '1234', partial=True) <regex.Match object; span=(0, 4), match='1234'> 完全匹配：没有partial >>> regex.search(r'\d{4}', '12345', partial=True) <regex.Match object; span=(0, 4), match='1234'> >>> regex.search(r'\d{4}', '12345', partial=True).partial 完全匹配 False >>> regex.search(r'\d{4}', '145', partial=True).partial 部分匹配 True >>> regex.search(r'\d{4}', '1245', partial=True).partial 完全匹配 False

(?P<name>) 允许组名重复	允许组名重复，后面的捕获覆盖前面的捕获可选组： >>> # Both groups capture, the second capture 'overwriting' the first. >>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or second") >>> m.group("item") 'second' >>> m.captures("item") ['first', 'second'] >>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", " or second") >>> m.group("item") 'second' >>> m.captures("item") ['second'] >>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or ") >>> m.group("item") 'first' >>> m.captures("item") ['first'] 强制性组： >>> m = regex.match(r"(?P<item>\w) or (?P<item>\w)?", "first or second") >>> m.group("item") 'second' >>> m.captures("item") ['first', 'second'] >>> m = regex.match(r"(?P<item>\w) or (?P<item>\w)", " or second") >>> m.group("item") 'second' >>> m.captures("item") ['', 'second'] >>> m = regex.match(r"(?P<item>\w) or (?P<item>\w)", "first or ") >>> m.group("item") '' >>> m.captures("item") ['first', '']
detach_string	匹配对象通过其string属性，对所搜索字符串进行引用。detach_string方法将“分离”该字符串，使其可用于垃圾回收，如果该字符串很大，则可能节省宝贵的内存。 `>>> m = regex.search(r"\w+", "Hello world") >>> print(m.group()) Hello >>> print(m.string) Hello world >>> m.detach_string() >>> print(m.group()) Hello >>> print(m.string) None`
(?0)、(?1)、(?2)	(?R)或(?0)尝试递归匹配整个正则表达式。 (?1)、(?2)等，尝试匹配相关的捕获组，第1组、第2组。(Tarzan\|Jane) loves (?1) == (Tarzan\|Jane) loves (?:Tarzan\|Jane) (?＆name)尝试匹配命名的捕获组。 >>> regex.match(r"(Tarzan\|Jane) loves (?1)", "Tarzan loves Jane").groups() ('Tarzan',) >>> regex.match(r"(Tarzan\|Jane) loves (?1)", "Jane loves Tarzan").groups() ('Jane',) >>> m = regex.search(r"(\w)(?:(?R)\|(\w?))\1", "kayak") >>> m.group(0, 1, 2) ('kayak', 'k', None)
模糊匹配	三种类型错误：插入： “i” 删除：“d” 替换：“s” 任何类型错误：“e” Examples: foo match “foo” exactly (?:foo){i} match “foo”, permitting insertions (?:foo){d} match “foo”, permitting deletions (?:foo){s} match “foo”, permitting substitutions (?:foo){i,s} match “foo”, permitting insertions and substitutions (?:foo){e} match “foo”, permitting errors 如果指定了某种类型的错误，则不允许任何未指定的类型。在以下示例中，我将省略item并仅写出模糊性： {d<=3} permit at most 3 deletions, but no other types {i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions {1<=e<=3} permit at least 1 and at most 3 errors {i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions It’s also possible to state the costs of each type of error and the maximum permitted total cost. Examples: {2i+2d+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4 {i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4 Examples: {s<=2:[a-z]} at most 2 substitutions, which must be in the character set [a-z]. {s<=2,i<=3:\d} at most 2 substitutions, at most 3 insertions, which must be digits. 默认情况下，模糊匹配将搜索满足给定约束的第一个匹配项。ENHANCEMATCH (?e)标志将使它尝试提高找到的匹配项的拟合度（即减少错误数量）。 BESTMATCH标志将使其搜索最佳匹配。 regex.search("(dog){e}", "cat and dog")[1] returns "cat" because that matches "dog" with 3 errors (an unlimited number of errors is permitted). regex.search("(dog){e<=1}", "cat and dog")[1] returns " dog" (with a leading space) because that matches "dog" with 1 error, which is within the limit. regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" (without a leading space) because the fuzzy search matches " dog" with 1 error, which is within the limit, and the (?e) then it attempts a better fit. 匹配对象具有属性fuzzy_counts，该属性给出替换、插入和删除的总数： >>> # A 'raw' fuzzy match: >>> regex.fullmatch(r"(?:cats\|cat){e<=1}", "cat").fuzzy_counts (0, 0, 1) >>> # 0 substitutions, 0 insertions, 1 deletion. >>> # A better match might be possible if the ENHANCEMATCH flag used: >>> regex.fullmatch(r"(?e)(?:cats\|cat){e<=1}", "cat").fuzzy_counts (0, 0, 0) >>> # 0 substitutions, 0 insertions, 0 deletions. 匹配对象还具有属性fuzzy_changes，该属性给出替换、插入和删除的位置的元组： >>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar') >>> m <regex.Match object; span=(7, 10), match='a f', fuzzy_counts=(0, 2, 2)> >>> m.fuzzy_changes ([], [7, 8], [10, 11])
\L<name>	Named lists 老方法： p = regex.compile(r"first\|second\|third\|fourth\|fifth")，如果列表很大，则解析生成的正则表达式可能会花费大量时间，并且还必须注意正确地对字符串进行转义和正确排序，例如，“ cats”位于“ cat”之间。新方法：顺序无关紧要，将它们视为一个set `>>> option_set = ["first", "second", "third", "fourth", "fifth"] >>> p = regex.compile(r"\L<options>", options=option_set)` named_lists属性： `>>> print(p.named_lists) # Python 3 {'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})} # Python 2 {'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}`
Set operators 集合、嵌套集合	仅版本1行为添加了集合运算符，并且集合可以包含嵌套集合。按优先级高低排序的运算符为： \|\| for union (“x\|\|y” means “x or y”) ~~ (double tilde) for symmetric difference (“x~~y” means “x or y, but not both”) && for intersection (“x&&y” means “x and y”) -- (double dash) for difference (“x–y” means “x but not y”) 隐式联合，即[ab]中的简单并置具有最高优先级。因此，[ab && cd] 与 [[a \|\| b] && [c \|\| d]] 相同。 eg： [ab] # Set containing ‘a’ and ‘b’ [a-z] # Set containing ‘a’ .. ‘z’ [[a-z]--[qw]] # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’ [a-z--qw] # Same as above [\p{L}--QW] # Set containing all letters except ‘Q’ and ‘W’ [\p{N}--[0-9]] # Set containing all numbers except ‘0’ .. ‘9’ [\p{ASCII}&&\p{Letter}] # Set containing all characters which are ASCII and letter
开始、结束索引	匹配对象具有其他方法，这些方法返回有关重复捕获组的所有成功匹配的信息。这些方法是： matchobject.captures([group1, ...]) matchobject.starts([group]) matchobject.ends([group]) matchobject.spans([group]) `>>> m = regex.search(r"(\w{3})+", "123456789") >>> m.group(1) '789' >>> m.captures(1) ['123', '456', '789'] >>> m.start(1) 6 >>> m.starts(1) [0, 3, 6] >>> m.end(1) 9 >>> m.ends(1) [3, 6, 9] >>> m.span(1) (6, 9) >>> m.spans(1) [(0, 3), (3, 6), (6, 9)]`

\G	搜索锚，它在每个搜索开始/继续的位置匹配，可用于连续匹配或在负变长后向限制中使用，以限制后向搜索的范围： `>>> regex.findall(r"\w{2}", "abcd ef") ['ab', 'cd', 'ef'] >>> regex.findall(r"\G\w{2}", "abcd ef") ['ab', 'cd']`

(?\|...\|...) 分支重置	捕获组号将在所有替代方案中重复使用，但是具有不同名称的组将具有不同的组号。 `>>> regex.match(r"(?\|(first)\|(second))", "first").groups() ('first',) >>> regex.match(r"(?\|(first)\|(second))", "second").groups() ('second',)` 注：只有一个组
超时	匹配方法和功能支持超时。超时（以秒为单位）适用于整个操作： `>>> from time import sleep >>> >>> def fast_replace(m): ... return 'X' ... >>> def slow_replace(m): ... sleep(0.5) ... return 'X' ... >>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2) 'XXXXX' >>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in sub endpos, concurrent, timeout) TimeoutError: regex timed out`