微微老师打算写一个Python项目,可以实现对任意英文文本的单词进行提取,以及单词出现的频率进行统计。

首先进行第一步,实现对任意英文文本的单词提取,如有以下英文文本,节选自风靡世界的青少年名著《小王子》:

Little Prince

Written By Antoine de Saiot-Exupery (1900~1944)

To Leon Werth

when he was a little boy

ask the indulgence of the children who may read this book for dedicating it to a grown-up. I have a serious reason: he is the best friend I have in the world. I have another reason: this grown-up understands everything, even books about children. I have a third reason: he lives in France where he is hungry and cold. He needs cheering up. If all these reasons are not enough, I will dedicate the book to the child from whom this grown-up grew. All grown-ups were once children-- although few of them remember it. And so I correct my dedication: 

  It was then that the fox appeared.

  And now six years have already gone by... 

  I have never yet told this story. The companions who met me on my return were well content to see me alive. I was sad, but I told them: "I am tired." 

  Now my sorrow is comforted a little. That is to say-- not entirely. But I know that he did go back to his planet, because I did not find his body at daybreak. It was not such a heavy body... and at night I love to listen to the stars. It is like five hundred million little bells... 

  But there is one extraordinary thing... when I drew the muzzle for the little prince, I forgot to add the leather strap to it. He will never have been able to fasten it on his sheep. So now I keep wondering: what is happening on his planet? Perhaps the sheep has eaten the flower... 

  At one time I say to myself: "Surely not! The little prince shuts his flower under her glass globe every night, and he watches over his sheep very carefully..." Then I am happy. And there is sweetness in the laughter of all the stars. 

  But at another time I say to myself: "At some moment or other one is absent-minded, and that is enough! On some one evening he forgot the glass globe, or the sheep got out, without making any noise, in the night..." And then the little bells are changed to tears... 

  Here, then, is a great mystery. For you who also love the little prince, and for me, nothing in the universe can be the same if somewhere, we do not know where, a sheep that we never saw has-- yes or no?-- eaten a rose... 

  Look up at the sky. Ask yourselves: is it yes or no? Has the sheep eaten the flower? And you will see how everything changes... 

  And no grown-up will ever understand that this is a matter of so much importance! 

  This is, to me, the loveliest and saddest landscape in the world. It is the same as that on the preceding page, but I have drawn it again to impress it on your memory. It is here that the little prince appeared on Earth, and disappeared. 

  Look at it carefully so that you will be sure to recognise it in case you travel some day to the African desert. And, if you should come upon this spot, please do not hurry on. Wait for a time, exactly under the star. Then, if a little man appears who laughs, who has golden hair and who refuses to answer questions, you will know who he is. If this should happen, please comfort me. Send me word that he has come back.

下面我们开始使用Python编程实现上文中每个单词的提取,可以将如上文本保存在txt,之后使用Python自带的open函数打开txt文本,通过识别判断并删选掉空格、标点符号等非单词符号,实现自动提取出单词,  如下图中所示:

Python实现任意文本单词提取及词频统计_python编程

Python实现任意文本单词提取及词频统计_公众号_02

将提取出来的单词保存在一个txt里,下一步,我们需要再写一个统计次数的代码,负责统计每个单词出现的频率。

  •  
... for line in words:          #print (line)         if(( line == 'var\n')|(line == 'var')):                varcount +=1            if(( line == 'word\n')|(line == 'word')):                wordcount +=1          if(( line == 'boy\n')|(line == 'boy')):                boycount +=1 ...

 如上述代码所示,统计单词’var‘所出现的次数,统计单词’word‘所出现的次数,统计单词’boy‘所出现的次数。结果如下:

Python实现任意文本单词提取及词频统计_公众号_03

结果表明,在上面英文文本中,单词’var‘出现了0次;单词’word‘出现了1次,单词’boy‘出现了1次,经过比对,完全正确。

我们可以多测试几次:

  •  
 ... for line in words:  if(( line == 'so\n')|(line == 'so')):                socount +=1  ...               

 

Python实现任意文本单词提取及词频统计_公众号_04

Python实现任意文本单词提取及词频统计_公众号_05