
1.双字节编码BPE-Byte pair encoding

The main purpose of this algorithm is to compress the data and solve the problem of unregistered words. Unregistered words here refer to words that do not appear in the training corpus, but appear when testing.


这是一个经常被忽略,但很有用的方法。规范化的目的是把文本转换为规范形式,比如将soooo 转换为so,这对于社交媒体等文本很重要,可以确保单词的特征在一个相似的维度。然而规范化没有一种标准的方法,需要根据场景定义。
This is an often overlooked but useful preprocessing method. The purpose of normalisation is to transform text into canonical form, like transforming soooo to so, which is important for data especially social media, to make sure features are on a similar scale. However, there is no standard approach to normalisation, and the method needs to be defined according to the scenario.


The purpose of lemmatisation is to reduce a word to a general form that can express the complete meaning. For example, transfer ‘good’, ‘better’, and ‘best’ to good.


词干提取与词形还原有共通之处,是去除词缀得到词根的过程。比如从cats, catlike, catty提取出同一个词根cat。
Stemming, has something in common to lemmatisation, is the process of removing affixes to get root of words. For example, we can extract the root ‘cat’ from ‘cats’, ‘catlike’, ‘catty’.

These two methods are both designed to reduce the influence of the deformation like simple and complex numbers and tense on the analysis results, but they may also have a negative impact on the training. A word may also end up having a different POS or meaning, because it got lemmatised or stemmed.

4.停用词删除Stopword Removal

Stopwords are commonly used words in a languagelike a, the, is, etc. in English. Removing these words helps us focus on more important words during analysis.

5.标点符号删除Punctuation removal

By removing the punctuation, it can reduce the structure characteristic noise and make the model more effective.


In English, lowercasing all text is one of the simplest and most effective methods for most natural language processing problems, and helps to improve the consistency of the expected output. However, note that some upper case words can have special meanings compared with lowercase words.