python按行分割文本 python 分割文本

转载

网络小墨 2023-08-09 14:41:24

文章标签 python按行分割文本 python 字符串 ico 文章分类 Python 后端开发

1.使用多个界定符分割字符串

string 对象的 str.split() 方法只适应于非常简单的单个字符串分割情形，它并不允许有多个分隔符或者是分隔符周围不确定的空格。

当你需要更加灵活的切割字符串的时候，最好使用 re.split() 方法：

line = 'asdf  fjdk; afed, fjek,asdf, foo'
import re
result=re.split(r'[;,\s]\s*', line)#以‘；’‘，’或空格，连续多个空格
print(result)

'''
#结果：
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
'''

当你使用 re.split() 函数时候，需要特别注意的是正则表达式中是否包含一个括号捕获分组。如果使用了捕获分组，那么被匹配的文本也将出现在结果列表中。

line = 'asdf  fjdk; afed, fjek,asdf, foo'
import re
result=re.split(r'(;|,|\s)\s*', line)
print(result)

'''
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
'''

如果你不想保留分割字符串到结果列表中去，但仍然需要使用到括号来分组正则表达式的话，确保你的分组是非捕获分组，形如 (?:...) 。

line = 'asdf  fjdk; afed, fjek,asdf, foo'
import re
result=re.split(r'(?:,|;|\s)\s*', line)
print(result)

'''
#结果：
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
''

2.字符串开头或结尾匹配

如果要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URL Scheme等等

已知字符串结果可以通过切片进行截取然后进行比对判断

其次，检查字符串开头或结尾的一个简单方法是使用 str.startswith() 或者是 str.endswith() 方法

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True
>>>

如果你想检查多种匹配可能，只需要将所有的匹配项放入到一个元组中去，然后传给 startswith() 或者 endswith() 方法：

>>> import os
>>> filenames = os.listdir('.')  #找到.目录下所有文件
>>> filenames
[ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ]


>>> [name for name in filenames if name.endswith(('.c', '.h')) ] #匹配多个 将其放入元组中
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True
>>>

3.用Shell通配符匹配字符串

fnmatch 模块提供了两个函数—— fnmatch() 和 fnmatchcase() ，可以用来实现使用 Unix Shell 中常用的通配符(比如 *.py , Dat[0-9]*.csv 等)去匹配文本字符串

（1）fnmatch

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']
>>>

但是fnmatch() 函数对于大小写敏感规则在不同的系统是不一样的，可以使用fnmatchcase() 来代替。它完全使用你的模式大小写匹配

>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True
>>>


>>> fnmatchcase('foo.txt', '*.TXT')
False
>>>

4.字符串匹配和搜索

（1）比较常见的：find endswitch startswitch方法

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> # Exact match
>>> text == 'yeah'
False
>>> # Match at start or end
>>> text.startswith('yeah')
True
>>> text.endswith('no')
False
>>> # Search for the location of the first occurrence
>>> text.find('no')
10
>>>

对于复杂的匹配需要使用正则表达式和 re 模块

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no

如果你想使用同一个模式去做多次匹配，你应该先将模式字符串预编译为模式对象

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if datepat.match(text2):
... print('yes')
... else:
... print('no')
...
no
>>>

match() 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置，使用 findall() 方法去代替。

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

5.字符串搜索和替换

找到指定字符串进行替换，简单替换可以直接使用 str.replace() 方法即可

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'
>>>

复杂字符串替换，使用 re 模块中的 sub() 函数

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

其中sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字比如 \3 指向前面模式的捕获组号。

6.字符串忽略大小写的搜索替换

为了在文本操作时忽略大小写，需要在使用 re 模块的时候给这些操作提供 re.IGNORECASE 标志参数。

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'
>>>

为了保持大小写保持一致的情况下进行替换

import re
text = 'UPPER PYTHON, lower python, Mixed Python'

def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()#如果对应大写 转大写
        elif text.islower():
            return word.lower()#如果对应小写 转小写
        elif text[0].isupper():
            return word.capitalize()#首字母大写
        else:
            return word
    return replace

res=re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)#matchcase('snake') 返回了一个回调函数 其参数必须是 match 对象
print(res)

'''
#结果：
UPPER SNAKE, lower snake, Mixed Snake
'''

7.最短匹配模式

>>> str_pat = re.compile(r'"(.*)"')
>>> text1 = 'Computer says "no."'
>>> str_pat.findall(text1)
['no.']
>>> text2 = 'Computer says "no." Phone says "yes."'
>>> str_pat.findall(text2)
['no." Phone says "yes.'] #贪婪原则下匹配最外侧两个双引号
>>>

在这个例子中，模式 r'\"(.*)\"' 的意图是匹配被双引号包含的文本。但是在正则表达式中*操作符是贪婪的，因此匹配操作会查找最长的可能匹配。于是在第二个例子中搜索 text2 的时候返回结果并不是我们想要的。为了修正这个问题，可以在模式中的*操作符后面加上?修饰符，这样就使得匹配变成非贪婪模式，从而得到最短的匹配

>>> str_pat = re.compile(r'"(.*?)"')
>>> str_pat.findall(text2)
['no.', 'yes.']
>>>

8.多行匹配模式

使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配

>>> comment = re.compile(r'/\*(.*?)\*/')#当你用点(.)去匹配任意字符的时候，忘记了点(.)不能匹配换行符\n的事实
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>>
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>>

为了修正这个问题，可以修改模式字符串，增加对换行的支持。

'''
?:后面匹配消耗字符，也就是说在一个匹配发生后，后面的匹配是从?:匹配到的字符后面开始继续匹配
?=后面匹配不消耗字符，也就是说在一个匹配发生后，后面的匹配是从最前面的字符开始而不是从?=匹配到的字符后面开始继续匹配
'''
>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\n multiline comment ']
>>>

9.将Unicode文本标准化

在Unicode中，某些字符能够用多个合法的编码表示，为了统一标准，在底层具有相同表示，将其采用unicodedata模块进行文本标准化

>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1 == s2
False
>>> len(s1)
14
>>> len(s2)
15
>>>

这里的文本”Spicy Jalapeño”使用了两种形式来表示。第一种使用整体字符”ñ”(U+00F1)，第二种使用拉丁字母”n”后面跟一个”~”的组合字符(U+0303)。

进行文本标准化

>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)#normalize() 第一个参数指定字符串标准化的方式。 NFC表示字符应该是整体组成(比如可能的话就使用单一编码)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalape\xf1o'
>>> t3 = unicodedata.normalize('NFD', s1)#NFD表示字符应该分解为多个组合字符表示。
>>> t4 = unicodedata.normalize('NFD', s2)
>>> t3 == t4
True
>>> print(ascii(t3))
'Spicy Jalapen\u0303o'
>>>

10.在正则式中使用Unicode

>>> import re
>>> num = re.compile('\d+')
>>> # ASCII digits
>>> num.match('123')
<_sre.SRE_Match object at 0x1007d9ed0>
>>> # Arabic digits
>>> num.match('\u0661\u0662\u0663')
<_sre.SRE_Match object at 0x101234030>
>>>

11.删除字符串中不需要的字符

（1）去掉文本字符串开头，结尾不想要的字符，比如空白。

strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操作。默认情况下，这些方法会去除空白字符，但是你也可以指定其他字符。

>>> # Whitespace stripping
>>> s = ' hello world \n'
>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world \n'
>>> s.rstrip()
' hello world'
>>>
>>> # Character stripping
>>> t = '-----hello====='
>>> t.lstrip('-')
'hello====='
>>> t.strip('-=') #只要头尾包含有指定字符序列中的字符就删除
'hello'
>>>

（2）去掉文本字符串中间不想要的字符，比如空白。

>>> s = ' hello     world \n'
>>> s.replace(' ', '') #空白字符替换为空
'helloworld'
>>> import re
>>> re.sub('\s+', ' ', s)#第一个参数为匹配模式（正则表达式匹配任意多个空白字符） 第二个参数为需要被替换成的字符
'hello world'
>>>

（3）strip方法在文件中读取多行数据

with open(filename) as f:
    lines = (line.strip() for line in f)
    for line in lines:
        print(line)

注意：在这里，表达式 lines = (line.strip() for line in f) 执行数据转换操作。这种方式非常高效，因为它不需要预先读取所有数据放到一个临时的列表中去。它仅仅只是创建一个生成器，并且每次返回行之前会先执行 strip 操作。strrip可以指定被删除字符可以删除换行符保证上下行文字连续方便后续匹配查找

12.审查清理文本字符串

在你的网站页面表单中输入文本”pýtĥöñ”，然后你想将这些字符清理掉

>>> remap = {
...     ord('\t') : ' ',
...     ord('\f') : ' ',
...     ord('\r') : None # Deleted
... }
>>> a = s.translate(remap)
>>> a
'pýtĥöñ is awesome\n'
>>>

>>> import unicodedata
>>> import sys
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
...                         if unicodedata.combining(chr(c)))
...#通过使用 dict.fromkeys() 方法构造一个字典，每个Unicode和音符作为键，对应的值全部为 None 
>>> b = unicodedata.normalize('NFD', a)#使用 unicodedata.normalize() 将原始输入标准化为分解形式字符
>>> b
'pýtĥöñ is awesome\n'
>>> b.translate(cmb_chrs)#再调用 translate 函数删除所有重音符
'python is awesome\n'
>>>

13.字符串对齐

通过某种对齐方式来格式化字符串

对于基本的字符串对齐操作，可以使用字符串的 ljust() , rjust() 和 center() 方法

>>> text = 'Hello World'
>>> text.ljust(20)
'Hello World         '
>>> text.rjust(20)
'         Hello World'
>>> text.center(20)
'    Hello World     '
>>>

>>> text.rjust(20,'=')
'=========Hello World'
>>> text.center(20,'*')
'****Hello World*****'
>>>

函数 format() 同样可以用来很容易的对齐字符串。你要做的就是使用 <,> 或者 ^ 字符后面紧跟一个指定的宽度。

>>> format(text, '>20')
'         Hello World'
>>> format(text, '<20')
'Hello World         '
>>> format(text, '^20')
'    Hello World     '
>>>

>>> format(text, '=>20s')
'=========Hello World'
>>> format(text, '*^20s')
'****Hello World*****'
>>>

注意：

(1)当格式化多个值的时候，这些格式代码也可以被用在 format() 方法中

>>> '{:>10s} {:>10s}'.format('Hello', 'World')
'     Hello      World'
>>>

(2)不仅适用于字符串。它可以用来格式化任何值包括数字

>>> x = 1.2345
>>> format(x, '>10')
'    1.2345'
>>> format(x, '^10.2f')
'   1.23   '
>>>

14.合并拼接字符串

将几个小的字符串合并为一个大的字符串

（1）直接合并输出

>>> a = 'Hello' 'World'
>>> a
'HelloWorld'
>>>

（2）合并少数几个字符串，使用加号(+)通常已经足够了

>>> a = 'Is Chicago'
>>> b = 'Not Chicago?'
>>> a + ' ' + b
'Is Chicago Not Chicago?'
>>>

用加号(+)操作符去连接大量的字符串的时候是非常低效率的，因为加号连接会引起内存复制以及垃圾回收操作

（3）join（）方法

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?']
>>> ' '.join(parts)
'Is Chicago Not Chicago?'
>>> ','.join(parts)
'Is,Chicago,Not,Chicago?'
>>> ''.join(parts)
'IsChicagoNotChicago?'
>>>

注意效率，根据场景选择合适方法

# Version 1 (string concatenation)
>>> f.write(chunk1 + chunk2)

# Version 2 (separate I/O operations)
>>> f.write(chunk1)
>>> f.write(chunk2)

如果两个字符串很小，那么第一个版本性能会更好些，因为I/O系统调用天生就慢。另外一方面，如果两个字符串很大，那么第二个版本可能会更加高效，因为它避免了创建一个很大的临时结果并且要复制大量的内存块数据。

15.字符串中插入变量

(1)format方法

>>> s = '{name} has {n} messages.'
>>> s.format(name='Guido', n=37)
'Guido has 37 messages.'
>>>

（2）format_map方法和vars方法

如果要被替换的变量能在变量域中找到，那么你可以结合使用 format_map() 和 vars()

>>> s = '{name} has {n} messages.'
>>> name = 'Guido'
>>> n = 37
>>> s.format_map(vars())
'Guido has 37 messages.'
>>>

要被替换的变量为实例对象

>>> class Info:
...     def __init__(self, name, n):
...         self.name = name
...         self.n = n
...
>>> a = Info('Guido',37)
>>> s.format_map(vars(a))
'Guido has 37 messages.'
>>>

16.以指定列宽格式化字符串

使用 textwrap 模块来格式化字符串的输出

s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

>>> import textwrap
>>> print(textwrap.fill(s, 70))
'''
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.
'''
>>> print(textwrap.fill(s, 40))
'''
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.
'''
>>> print(textwrap.fill(s, 40, initial_indent='    '))#initial_indent – 默认为 ''，将被添加到被自动换行输出内容的第一行的字符串
'''
    Look into my eyes, look into my
eyes, the eyes, the eyes, the eyes, not
around the eyes, don't look around the
eyes, look into my eyes, you're under.
'''
>>> print(textwrap.fill(s, 40, subsequent_indent='    '))#subsequent_indent – 默认为 ''，将被添加到被自动换行输出内容除第一行外的所有行的字符串
'''
Look into my eyes, look into my eyes,
    the eyes, the eyes, the eyes, not
    around the eyes, don't look around
    the eyes, look into my eyes, you're
    under.
'''

17.在字符串中处理html和xml

(1)将html或xml中的 ‘<’ 或者 ‘>’ 替换文本字符串，使用 html.escape() 函数可以很容易的完成

>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as "<tag>text</tag>".

>>> # Disable escaping of quotes
>>> print(html.escape(s, quote=False))
Elements are written as "<tag>text</tag>".
>>>

（2）收到含有编码值的原始文本，需要手动去替换，需要使用HTML或者XML解析器的一些相关工具函数/方法即可

>>> s = 'Spicy "Jalapeño".'
>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.unescape(s)
'Spicy "Jalapeño".'
>>>
>>> t = 'The prompt is >>>'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'
>>>

18.字符串令牌解析

将一个字符串，从左至右将其解析为一个令牌流

from collections import namedtuple

tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),
          ('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]
#利用命名捕获组的正则表达式来定义所有可能的令牌，包括空格：
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

def generate_tokens(pat, text):
    Token = namedtuple('Token', ['type', 'value'])#namedtuple创建一个和元组类似的对象，而且对象拥有可访问的属性
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())

# Example use
for tok in generate_tokens(master_pat, 'foo = 42* 23'):
    print(tok)

'''
Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')
Token(type='TIMES', value='*')
Token(type='WS', value=' ')
Token(type='NUM', value='23')
'''

19.字节字符串上的字符串操作

在字节字符串上执行普通的文本操作(比如移除，搜索和替换)。

（1）在字节字符串上支持

>>> data = b'Hello World'
>>> data[0:5]
b'Hello'
>>> data.startswith(b'Hello')
True
>>> data.split()
[b'Hello', b'World']
>>> data.replace(b'Hello', b'Hello Cruel')
b'Hello Cruel World'
>>>

（2）在字节数组上也支持

>>> data = bytearray(b'Hello World')
>>> data[0:5]
bytearray(b'Hello')
>>> data.startswith(b'Hello')
True
>>> data.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> data.replace(b'Hello', b'Hello Cruel')
bytearray(b'Hello Cruel World')
>>>

注意：

（1）字节字符串的索引操作返回整数而不是单独字符

>>> a = 'Hello World' # Text string
>>> a[0]
'H'
>>> a[1]
'e'
>>> b = b'Hello World' # Byte string
>>> b[0]
72
>>> b[1]
101
>>>

（2）字节字符串不会提供一个美观的字符串表示，也不能很好的打印出来，除非它们先被解码为一个文本字符串。

>>> s = b'Hello World'
>>> print(s)
b'Hello World' # Observe b'...'
>>> print(s.decode('ascii'))
Hello World
>>>

（3）不存在任何适用于字节字符串的格式化操作 format和%将失效除非进行encode编码转换为ASCII码进行操作

>>> b'%10s %10d %10.2f' % (b'ACME', 100, 490.1)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'
>>> b'{} {} {}'.format(b'ACME', 100, 490.1)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'
>>>


如果想格式化字节字符串，得先使用标准的文本字符串，然后将其编码为字节字符串。
>>> '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')
b'ACME 100 490.10'
>>>

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。