python文本处理指南中文版 python 文本处理模块

转载

mob64ca14101b2f 2024-01-07 16:52:48

文章标签 python文本处理指南中文版字符串 bc 子串 文章分类 Python 后端开发

前言

字符串处理是编程中常用到的操作，本系列总结的目标是通过系统的介绍不同的方法来完成不同复杂度的字符串处理操作。
旨在方便大家遇到不同的需求时，可以快速找到合适的处理方式，从而使代码开发快速，简洁，稳定的目的。
本文为系列第一篇，简单的字符串处理. Pyhton内置的str模块提供很多常用的字符串处理的功能。本文将其分类介绍一下
Python中的文本处理（二）re 模块的常用方法
Python中的文本处理（三）2种更加pythonic的字符串处理方式

1.str 模块

在交互式命令行输入dir(str)可以查看到str模块提供的函数，提供了基本的字符串处理需要的方法，熟练掌握运用这些方法，可以通过充分运用内置函数的功能，使得代码简洁和稳定。

按照功能，将str的内置函数分为一下4大类，

对字符串不做修改（判断isalpha，查找find和计数count）
对原字符串修改（大小写，填充ljust，剥离strip，格式化format）
和其他数据类型转换（划分split，聚合join）
编码转换 encode

下面我们依次介绍，每个函数的功能

1.1.1 判断-用于校验字符串是否符合某个特定规则的方法

函数	说明	语法	样例
⭐️ startswith	判断字符串是否以指定前缀开头	S.startswith(prefix[, start[, end]]) -> bool	① 'this'.startswith('th') ➡ True ② 'this'.startswith('hi') ➡ False ③ 'this'.startswith('hi', 1) ➡ True
⭐️ endswith	判断字符串是否以指定后缀结尾	S.endswith(suffix[, start[, end]]) -> bool	① 'this'.endswith('is') ➡ True ② 'this'.endswith('hi') ➡ False ③ 'this'.endswith('hi', 0，3) ➡ True
isalnum	判断alpha + num格式 [a-z][A-Z][0-9]	S.isalnum()-> bool	① 'fsj23289'.isalnum() ➡ True ② 'fsj23@289'.isalnum() ➡ False
isalpha	判断字母格式 [a-z][A-Z]	S.isalpha()-> bool	① 'fsj'.isalnum() ➡ True ② 'fsj@'.isalnum() ➡ False
isascii	判断是否是ascii码	S.isascii()-> bool	① 'Aa*%'.isascii() ➡ True ② '中国'.isascii() ➡ False
⭐️ isdigit	判断数字格式 [0-9]	S.isdigit()-> bool	① '2374812328'.isdigit() ➡ True ② '237,481,328'.isdigit() ➡ False
⭐️ islower	判断小写字母格式 [a-z]	S.islower()-> bool	① 'this string'.islower() ➡ True ② 'This string'.islower() ➡ False
⭐️ isupper	判断大写字母格式 [a-z]	S.isupper()-> bool	① 'THIS STRING'.isupper() ➡ True ② 'THIS sTRING'.isupper() ➡ False
istitle	判断首字母大写 [A-Z][a-z]?+	S.istitle()-> bool	①'This string'.istitle() ➡ False ②'This String'.istitle() ➡ True
isspace	判断是否均为空格 \s+	S.isspace()-> bool	① ' \t \n \v'.isspace() ➡ True
isdecimal
isidentifier
isnumeric
isprintable

1.1.2 查找和计数

函数	说明	语法	样例
⭐️ find	返回第一个匹配的子串的坐标，不存在子串时返回 -1	S.find(sub[, start[, end]]) -> int	① 'this is'.find('is') ➡ 2 ② 'this is'.find('not') ➡ -1
index	返回第一个匹配的子串的坐标，不存在子串时报错ValueError	S.index(sub[, start[, end]]) -> int	① 'this is'.index('is') ➡ 2 ② 'this is'.index('not') ➡ ValueError
rfind	从右边匹配第一个子串的坐标，不存在子串时返回 -1	S.rfind(sub[, start[, end]]) -> int	① 'this is'.rfind('is') ➡ 5 ② 'this is'.rfind('not') ➡ -1
rindex	从右边匹配第一个子串的坐标，不存在子串时报错ValueError	S.rindex(sub[, start[, end]]) -> int	① 'this is'.rindex('is') ➡ 5 ② 'this is'.rindex('not') ➡ ValueError
⭐️ count	返回字符串中出现子串的次数	S.count(sub[, start[, end]]) -> int	① 'this is'.count('is') ➡ 2 ② 'this is'.count('not') ➡ 0

1.2 修改

1.2.1 大小写转换

函数	说明	语法	样例	返回
⭐️ lower	将字符串转换为小写	S.lower() ➡ str	'This string'.lower()	'this string'
⭐️ upper	将字符串转换为大写	S.upper() ➡ str	'This string'.upper()	'THIS STRING'
capitalize	所有单词首字母大写	S.capitalize() ➡ str	'this string'.capitalize()	'This String'
title	第一个字母大写	S.title() ➡ str	'this string'.title()	'This string'
swapcase	反转大小写	S.swapcase() ➡ str	'This String'.swapcase()	'tHIS sTRING'
casefold	基本等于lower,处理一些特殊字符时使用	S.casefold() ➡ str	"der Fluß".casefold()	'der fluss'

1.2.2 填充和剥离

函数	说明	语法	样例	返回
⭐️ ljust	使用指定字符，将字符串向右填充到一定长度	S.ljust(width, fillchar=' ') ➡ str	'This'.ljust(20)	'This '
			'This'.ljust(20, '#')	'This################'
rjust	使用指定字符，将字符串向左填充到一定长度	S.rjust(width, fillchar=' ') ➡ str	'This'.rjust(20)	' This'
			'This'.rjust(20, '-')	'----------------This'
center	使用指定字符，将字符串向两侧填充到一定长度	S.center(width, fillchar=' ') ➡ str	'This'.center(20)	' This '
			'This'.center(20, 'a')	'aaaaaaaaThisaaaaaaaa'
zfill	在字符串左侧补零到指定长度认为时rjust的简写	S.zfill(self, width) ➡ str	'89'.zfill(8)' == '89'.rjust(8, '0')	'00000089'

⭐️ strip	移除两侧的空格，如果指定chars，移除chars内的字符	S.strip(width, chars=' ') ➡ str	' This '.strip()	'This'
			'abcabcThisaaaaaaaa'.strip('abc')	'This'
lstrip	移除左侧的空格，如果指定chars，移除chars内的字符	S.lstrip(width, chars=' ') ➡ str	' This '.lstrip()	'This '
			'abcabcThisaaaaaaaa'.lstrip('abc')	'Thisaaaaaaaa'
rstrip	移除右侧的空格，如果指定chars，移除chars内的字符	S.rstrip(width, chars=' ') ➡ str	' This '.rstrip()	' This'
			'abcabcThisaaaaaaaa'.rstrip('abc')	'abcabcThis'

1.2.3 替换和格式化

⭐️ replace 将匹配的子串替换成新的指定字符串 S.replace(old, new, count=-1）

In :'this is my string'.replace('is', 'notytall')
Out: 'thnotytall notytall my string'

In : 'this is my string'.replace('is', 'notytall', 1)
Out: 'thnotytall is my string'

expandtabs 将tab字符转换成空格 S.expandtabs(tabsize=8) ，可认为是replace('\t', ' ')的简写

In : '\tthis is my \tstring'.expandtabs()
Out: '        this is my      string'

In : '\tthis is my \tstring'.expandtabs(4)
Out: '    this is my  string'

translate & maketrans 将匹配的子串替换成新的指定字符串 S.translate(table)

In : tab = str.maketrans('abcde', '12345')
In : 'The match was abandoned because of bad weather'.translate(tab)
Out: 'Th5 m1t3h w1s 121n4on54 2531us5 of 214 w51th5r'

⭐️ format 将{}括起来的字符串替换为指定参数 S.format(*args, **kwargs)

In : 'this is {} {} string'.format('not', 'my')
Out: 'this is not my string'

In : 'this is {} {} string. it belongs to {who}.'.format('not', 'my', who='jason')
Out: 'this is not my string. it belongs to jason.'

format_map 功能与format类似，参数类型不同。format_map不支持position args S.format_map(mapping)

In : map = {'a': 'not', 'b': 'my', 'who': 'jason'}

In : 'this is {a} {b} string. it belongs to {who}.'.format_map(map)
Out: 'this is not my string. it belongs to jason.'

In : 'this is {a} {b} string. it belongs to {who}.'.format(**map)
Out: 'this is not my string. it belongs to jason.'

1.2.4 切分和合并

⭐️ split rsplit 将字符串按照指定字符分割成，返回一个数组 S.split(sep=None, maxsplit=-1)

参数 sep 指定分隔符，默认使用空格， maxsplit指定最大分组个数

In : 'this is not my string'.split()
Out: ['this', 'is', 'not', 'my', 'string']

# 指定分割符
In : 'this_is_not_my_string'.split('_')
Out: ['this', 'is', 'not', 'my', 'string']

# 指定分组大小
In : 'this is not my string'.split(maxsplit=2)
Out: ['this', 'is', 'not my string']

# rsplit从右边开始切分
In : 'this is not my string'.rsplit(maxsplit=2)
Out: ['this is not', 'my', 'string']

splitlines 将字符串按照换行符分割，返回一个数组 S.splitlines( keepends=False)

In : 'this is not my string\n second line'.splitlines()
Out: ['this is not my string', ' second line']

partition rpartition 将字符串按照指定字符分割成2半，返回一个元组 S.partition((sep)

In : 'this is not my string'.partition(' ')
Out: ('this', ' ', 'is not my string')

In : 'this is not my string'.rpartition(' ')
Out: ('this is not my', ' ', 'string')

⭐️ join 完成和split相反的动作，指定连接符将一个可迭代对象合并成一个字符串，返回一个数组 S.join(iterable)

In : ' '.join(['this', 'is', 'not', 'my', 'string'])
Out: 'this is not my string'

In : '^-^'.join(['this', 'is', 'not', 'my', 'string'])
Out: 'this^-^is^-^not^-^my^-^string'

1.2.5 编码

encode 将字符串转换为不同格式的编码

In : '你好'.encode('gbk')
Out: b'\xc4\xe3\xba\xc3'

In : '你好'.encode('utf-8')
Out: b'\xe4\xbd\xa0\xe5\xa5\xbd'

2. string 模块

string模块提供了一下常用的字符集

ascii_letters	abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
ascii_lowercase	abcdefghijklmnopqrstuvwxyz
ascii_uppercase	ABCDEFGHIJKLMNOPQRSTUVWXYZ
digits	0123456789
hexdigits	0123456789abcdefABCDEF
octdigits	01234567
printable	0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{\|}~ \t\n\r\x0b\x0c
punctuation	!"#$%&\'()*+,-./:;<=>?@[\\]^_`{\|}~'
whitespace	\t\n\r\x0b\x0c

3.总结

str 模块最常用的方法

① 断言类： startswith， isdigit， isupper

② 查找和计数： find， count

③修改： lower， upper ljust， strip replace， format split，join

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：可视化大屏数字动画 jquery ui可视化大屏

下一篇：java项目灰度发布的实现步骤灰度发布方式

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯