^第三题:一个英文的纯文本文件,统计其中的单词出现的个数。
统计什么好呢,就拿Python彩蛋import this来试试吧。(将下列单词保存为“test.txt”)>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
一、分析(python 正则表达式 re findall 方法能够以列表的形式返回能匹配的子串。)
参考re模块 —— rere.findall(pattern, string, flags=0)
作为一个字符串列表,在字符串中,返回所有非重叠匹配的模式。The string是从左到右扫描的,所以匹配的内容是按照该顺序来的如果模式中存在一个或多个组,请返回组列表;如果模式具有多个组,这将是元组的列表。Return all non-overlapping matches of pattern in string, as a list of strings. The string是从左到右扫描的,所以匹配的内容是按照该顺序来的If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
二、实验
先来了解re.findall的用法
>>> import re
>>> re.findall('cat','dog cat dog')
['cat']
>>> re.findall('3','2,4,6,ss,gg')
[]
>>> re.findall('3','2,4,6,ss,gg,3')
['3']
可以看到re.findall就是用来匹配的
下一步需要找到能代表字母的表达式
可以使用方括号来指定多个字符区间。例如正则表达式[A-Za-z]匹配任何字母,包括大写和小写的;正则表达式[A-Za-z][A-Za-z]* 匹配一个字母后面接着0或者多个字母(大写或者小写)。当然我们也可以用元字符+做到同样的事情,也就是:[A-Za-z]+ ,和[A-Za-z][A-Za-z]*完全等价。但是要注意元字符+ 并不是所有支持正则表达式的程序都支持的。关于这一点可以参考后面的正则表达式语法支持情况。
[^a-zA-Z] 简单来说就是任意一个非字母的字符,虽然可以匹配除字母之外的任意字符,但只能是一个,不是多个
如果想匹配多个非字母的字符,需要在后面加量词修饰,如
[^a-zA-Z]+ 表示1个或多个非字母字符
[^a-zA-Z]{5,10} 给示5到10个除字母之外的字符
^[a-z] 匹配以小写字母开头的文本串 ;[^a-z] 表示与不包含小写字母的字符匹配
test.py中没有数字,我们可以选用[^a-zA-Z]
三、代码
import re
def count(filepath):
f = open(filepath, 'r')
s = f.read()
words = re.findall(r'[^a-zA-Z]+', s)
return len(words)
if __name__ == '__main__':
num = count('test.txt')
print (num)
用这个版本得出的结果是208,我还发现了很多大神发布的别的版本,可以参考下,但是得出的最终结果却不一样。
import re
with open('test.txt','r')as f:
data = f.read()
result = re.split(r"[^a-zA-Z]",data)
print (len([x for x in result if x!= '']))
好简洁的版本,结果是149,使用的“re.split”
import re
with open('test.txt','r')as f:
data = f.read()
result = re.findall(r"[^a-zA-Z]+",data)
print("the number of words in the file is: %s" % len(result))
结果是150?
import re
def get_num():
num = 0
f = open('test.txt', 'r')
for line in f.readlines():
num += len(re.findall(r'[^a-zA-Z]+', line))
f.close()
return num
if __name__ == '__main__':
print(get_num())结果是174