处理文本的机器学习算法有哪些

转载

karen 2024-07-16 11:32:52

文章标签 处理文本的机器学习算法有哪些 NLP AI 预处理语言模型 文章分类 机器学习人工智能

文章目录

基本文本处理技能

文本处理基本流程
中英文文本预处理特点
文本预处理

读取文本
去除数据中非文本部分
分词
去除停用词
词频统计

语言模型

参考资料

基本文本处理技能

文本处理基本流程

中英文文本都存在一致的基本处理流程, 主要包括: 分词(Segmentation), 清洗(Cleaning), 标准化(Normalization), 特征提取(Feature Extraction)和建模(Modeling).

中英文文本预处理特点

中英文文本虽然总体预处理流程一致, 但是存在一些本质的区别. 首先, 中文不像英文天然使用空格和符号完成了分词, 因此需要使用分词算法将一段文本进行切分. 另外英文也存在自身的一些特殊问题: 如拼写错误, 词形还原等. 词形还原是由于英文单词会随着不同的上下文出现各种不同的形式, 这些形式都是表示同一个词, 但是由于拼写改变被当做了不同的词.

文本预处理

完整代码github

读取文本

以上一篇博客使用的THUCNews的子集数据为例

ch_data_file = '../task1/cnews/cnews.train.txt'
with open(ch_data_file, 'r', encoding='utf-8') as f:
    ch_samples = [x.strip().split('\t') for x in f.readlines()]
    
print(len(ch_samples))
print(ch_samples[0])

去除数据中非文本部分

通过正则表达式的方式去除文本中非文本部分

import re

def filter_nontext(samples):
    # 过滤不了\\ \ 中文（）还有
    r1 = u'[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~]+'
    #用户也可以在此进行自定义过滤字符 # 者中规则也过滤不完全
    r2 = "[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+"
    # \\\可以过滤掉反向单杠和双杠，/可以过滤掉正向单杠和双杠，第一个中括号里放的是英文符号，第二个中括号里放的是中文符号，第二个中括号前不能少|，否则过滤不完全
    r3 =  "[.!//_,$&%^*()<>+\"'?@#-|:~{}]+|[——！\\\\，。=？、：“”‘’《》【】￥……（）]+" 
    # 去掉括号和括号内的所有内容
    r4 =  "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——！\\\，。=？、：“”‘’￥……（）《》【】]"

    clear_samples = []
    for sample in samples:
        sentence = sample[1]
        cleanr = re.compile('<.*?>')
        sentence = re.sub(cleanr, ' ', sentence) #去除html标签
        sentence = re.sub(r4,'',sentence)
        clear_samples.append([sample[0], sentence])
    return clear_samples

clear_samples = filter_nontext(ch_samples)
print(len(clear_samples))
print(clear_samples[0])
print(clear_samples[1])

分词

英文分词一般可以直接使用split()操作
中文分词需要使用专门的分词算法, 如jieba分词

import jieba 

def cut_samples(samples):
    new_samples = []
    for sample in samples:
        sentence = sample[1]
        sentence_seg = jieba.cut(sentence)
        new_samples.append([sample[1], list(sentence_seg)])
    return new_samples

seg_samples = cut_samples(clear_samples)
print(len(seg_samples))
print(seg_samples[0])
print(seg_samples[1])

去除停用词

中文停用词表可以参考中文停用词, 将对应停用词表下载并读取.

from nltk.corpus import stopwords 
#stop = set(stopwords.words('english')) 
with open('ch_stopwords.txt', 'r', encoding='utf-8') as f:
    stop = [x.strip() for x in f.readlines()]
    stop = set(stop)
print(stop)

def filter_stopwords(samples):
    new_samples = []
    for sample in samples:
        sentence = sample[1]
        filter_sentence= [w for w in sentence if w not in stop]
        new_samples.append((sample[1], filter_sentence))
    return new_samples

nostop_samples = filter_stopwords(seg_samples)
print(len(nostop_samples))
print(nostop_samples[0])
print(nostop_samples[1])

词频统计

from collections import Counter

def count(samples):
    cnt = Counter()
    for sample in samples:
        cnt += Counter(sample[1])
    return cnt

cnt = count(nostop_samples)
print(cnt.most_common(100))

语言模型

简单的说，语言模型(Language Model)是用来计算一个句子出现概率的模型, n-gram语言模型指由n个连续词组成的词组集合, n=1称为uni-gram, n=2称为bi-gram, n=3称为tri-gram. 以文本I love deep learning为例:

Uni-gram: {I}, {love}, {deep}, {learning}
Bi-gram : {I, love}, {love, deep}, {love, deep}, {deep, learning}
Tri-gram : {I, love, deep}, {love, deep, learning}

给定n-gram表示的文本序列 $处理文本的机器学习算法有哪些_AI$ , 其中 $处理文本的机器学习算法有哪些_处理文本的机器学习算法有哪些_02$ 表示n-gram表示的文本序列中地 $处理文本的机器学习算法有哪些_预处理_03$ 个词组, 语句 $处理文本的机器学习算法有哪些_AI_04$ 出现的概率可以表示为: