(文章最后有全部源代码)

一、实验要求

1.1实验目的
1)理解分类任务;
2)考察学生对数据预处理步骤的理解,强化预处理的重要性;
3)基模型可以调用已有的包,训练学生熟悉数据挖掘的基本框架;
4)学会多维度的对模型进行评估以及模型中参数的讨论。

1.2数据集
1)新闻文本分类为中文数据集,需要进行一定的预处理,包括分词,去停用词等;图像数据集可根据情况自行处理。
2)数据中的其他问题可自行酌情处理;
数据说明:自行划分 train 和 test,一般按 7:3 划分。

1.3实验环境
开发环境:Python 3.7( jieba、pandas、numpy、sklearn、matplotlib.pyplot)

1.4方法要求
1)要有针对数据特点的预处理步骤,包括去停用词,降维等;
2)原则上不限制模型,决策树,NB,NN,SVM,random forest 均可,且不限于上述方法。
3)文本可采用 BOW,主题模型以及词向量等多种表示方式,图像数据集可采用 LBP,HOG,SURF 等特征表示方式。

1.5结果要求
1)实现一个或多个基本分类模型,并计算其评估指标如准确率,召回率等
2)对模型中的关键参数,(如决策树中停止分裂条件,NN 中层数等参数)进行不同范围的取值,讨论参数的最佳取值范围。
3)对比分析不同的特征表示方法对结果的影响。
4)若对同一数据采用两种或多种模型进行了分类,对多种模型结果进行对比,以评估模型对该数据集上分类任务的适用性。

二、实验内容

2.1数据预处理
对在停用词表中的分词进行过滤操作:

def pre_treating(para):
 words = jieba.cut(str(para))#分词
 words = [word for word in words if len(word)>1]
 words = [word for word in words if word not in stopWords]
 return words

对数据集(训练集和测试集)中的无用分词删除,只保留可能有用的分词:
for name in classname:

data[name][‘words’]= data[name][‘content’].apply(pre_treating)
 test[name][‘words’]= data[name][‘content’].apply(pre_treating)

将标签加入属性中:

i = 0;
 for name in classname:
 data[name][‘flag’] = i
 test[name][‘flag’] = i
 i += 1;

2.2词频统计
汇集所有表的内容:

result = data[classname[0]]
 testdata = test[classname[0]]
 for name in classname[1:]:
 result = result.append(data[name])
 testdata = testdata.append(data[name])

统计800个频率最高的词组:

topWordNum = 800
 items = result[‘words’].values.tolist()
 words = []
 for item in items:
 words.extend(item)
 wordCount = pd.Series(words).value_counts()[0:topWordNum]
 wordCount = wordCount.index.values.tolist()

2.3将词组转换为向量
将词组转为向量,此处向量的数为高频词出现次数:

def wordsToVec(words):
 vec = map(lambda word:words.count(word),wordCount)
 vec = list(vec)
 return vec

将向量添加到汇集的结果中,并去除表中无用的部分:

result[‘vec’] = result[‘words’].apply(wordsToVec)
 result = result.drop([‘content’],axis=1)
 result = result.drop([‘channelName’],axis=1)testdata[‘vec’] = testdata[‘words’].apply(wordsToVec)
 testdata = testdata.drop([‘content’],axis=1)
 testdata = testdata.drop([‘channelName’],axis=1)

2.4随机森林方法
将标签和向量转换为x,y的值:

xTrain = result[‘vec’].tolist()
 yTrain = result[‘flag’].tolist()
 xTest = testdata[‘vec’].tolist()
 yTest = testdata[‘flag’].tolist()

随机森林方法:
def get_rf_ascore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = accuracy_score(y_test, y_pre)
return score

def get_rf_rscore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = recall_score(y_test, y_pre,average = 'macro')
return score

2.5决策树方法
使用 sklearn 函数包来实现决策树方法。分类随机森林对应的类DecisionTreeClassifier。实验代码如下:
def get_dt_ascore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = accuracy_score(y_test, y_pre)
return score

def get_dt_rscore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = recall_score(y_test, y_pre,average = 'macro')

return score

2.6随机森林结果显示

rfaScore = [ ]
 rfrScore = [ ]
 estimator = np.arange(1, 20, 1)for i in estimator:
 temp_ascore = get_rf_ascore(i)
 temp_rscore = get_rf_rscore(i)
 rfaScore.append(temp_ascore)
 rfrScore.append(temp_rscore)plt.plot(rfaScore,color=‘red’)
 plt.plot(rfrScore,color=‘green’)
 plt.xlabel(‘estimator’)
 plt.ylabel(‘testDepth’)
 plt.show()

2.7决策树方法结果显示

dtaScore = [ ]
 dtrScore = [ ]
 testDepth = np.arange(1, 100, 1)for i in testDepth:
 temp_ascore = get_dt_ascore(i)
 temp_rscore = get_dt_rscore(i)
 dtaScore.append(temp_ascore)
 dtrScore.append(temp_rscore)plt.plot(dtaScore,color=‘yellow’)
 plt.plot(dtrScore,color=‘blue’)
 plt.ylabel(‘score’)
 plt.xlabel(‘testDepth’)
 plt.show()

三、实验分析和总结

3.1实验分析

数据挖掘应用实验结果分析_随机森林


子树的数量 n_estimators 从 1 到 20,系统评分所示。从图中可以看出,在1到20范围内,随着子树数量的增加,模型评分随之增加,分类预测的准确率、回归率随之提高。至于20棵子树之后的评分趋势,则应该进行额外的实验来验证。(红色的是准确率,绿色的是召回率)

数据挖掘应用实验结果分析_数据挖掘_02


决策树最大深度 max_depth 从 1 到 100,系统评分所⽰。从图中可以

看出,在 1 到 100 范围内,随着树的最大深度的增加,模型评分随之增加,分

类预测的准确率随之提高。至于深度大于 100 的评分趋势,则应该进行额外的

实验来验证。(黄色的是准确率,蓝色的是召回率)

3.2实验总结

通过本次实验,我熟悉了sklearn包中几个模型的使用。这些模型在学术科研中得到广泛使用。通过上网查阅资料,我学习了对中文文本的预处理,即停用词的过滤。此外,还加深了对随机森林算法的理解,通过 sklearn 包实现算法的训练与预测,我对于数据挖掘和python的使用有了更深的理解,在python的使用上更加得心应手,通过实验,让我再次感受到python的便捷性。

源代码

NB

import pandas as pd
import re
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

#读取训练集
data = pd.read_excel("train.xlsx", encoding = 'utf-8')
test = pd.read_excel("test.xlsx", encoding = 'utf-8')

#去除没有标签的样本
index = data['channelName'].notnull()
data = data[index]
index = data['title'].notnull()
data = data[index]
index = test['channelName'].notnull()
test = test[index]
#print(news)

#去标点
re_obj = re.compile(r"['~`!#$%^&*()_+-=|\';:/.,?><~·!@#¥%……&*()——+-=“:’;、。,?》《{}':【】《》‘’“”\s]+")
def get_stopword():
    s = set()
    with open('中文停用词表.txt', encoding = 'utf-8') as f:
        for line in f:
            s.add(line.strip())
    return s
stopword = get_stopword()

def remove_stopword(words):
    return [word for word in words if word not in stopword]
def Data_preprocessing(text):
    text = re_obj.sub("", text)
    text = jieba.lcut(text)
    text = remove_stopword(text)
    return " ".join(text)
    
data['title'] = data['title'].apply(Data_preprocessing)
test['title'] = test['title'].apply(Data_preprocessing)

#标签映射
dic = {'财经' : 0, '房产' : 1, '教育' : 2, '科技' : 3, '军事' : 4, '汽车' : 5, '体育' : 6, '游戏' : 7, '娱乐' : 8, '养生健康' : 9, '历史' : 10, '搞笑' : 11, '旅游' : 12, '母婴' : 13}
data['channelName'] = data['channelName'].map(dic)
test['channelName'] = test['channelName'].map(dic)
#print(news['channelName'].value_counts())

x_train = data['title']
y_train = data['channelName']
x_test = test['title']
y_test = test['channelName']

#ngram_range词组切分的长度范围  string类型  取前5000个  线性缩放
vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', max_features=30000, sublinear_tf=True)
vectorizer.fit(x_train)
#学习原始文档中所有标记的词汇词典
model = MultinomialNB(alpha=0.001)
model.fit(vectorizer.transform(x_train), y_train)

yPred = model.predict(x_test)
yTest = np.array(y_test)
yPred = np.array(yPred)

rum

import pandas as pd
import jieba
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
import matplotlib.pyplot as plt

with open('中文停用词表.txt','r',encoding='utf-8')as f:
    stopWords = [line.strip('\n') for line in f.readlines()]
    stopWords += '\n'
    
data = pd.read_excel('train.xlsx',sheet_name=None)
test = pd.read_excel('test.xlsx',sheet_name=None)
classname = ['财经','房产','教育','科技','军事','汽车','体育','游戏','娱乐','养生健康','历史','搞笑','旅游','母婴']

#对在停用词表中的分词进⾏过滤操作
def pre_treating(para):
    
    words = jieba.cut(str(para))#分词
    words = [word for word in words if len(word)>1]
    words = [word for word in words if word not in stopWords]
    return words

#对数据集(训练集和测试集)中的无用分词删除,只保留可能有用的分词。
for name in classname:
    
    data[name]['words'] = data[name]['content'].apply(pre_treating)
    test[name]['words'] = data[name]['content'].apply(pre_treating)

i = 0;
#将标签加入属性中
for name in classname:
    data[name]['flag'] = i
    test[name]['flag'] = i
    i += 1;
    #print(data[name])

result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:
    result = result.append(data[name])
    testdata = testdata.append(data[name])
#print(result)

topWordNum = 800
items = result['words'].values.tolist()
words = []
for item in items:
    words.extend(item)
#统计800个频率最高的词组
wordCount = pd.Series(words).value_counts()[0:topWordNum]
#去除频率。仅获取词组
wordCount = wordCount.index.values.tolist()


def wordsToVec(words):
    #将词组转为向量,此处向量的数为高频词出现次数
    vec = map(lambda word:words.count(word),wordCount)
    vec = list(vec)
    return vec

result['vec'] = result['words'].apply(wordsToVec)
result = result.drop(['content'],axis=1)
result = result.drop(['channelName'],axis=1)

testdata['vec'] = testdata['words'].apply(wordsToVec)
testdata = testdata.drop(['content'],axis=1)
testdata = testdata.drop(['channelName'],axis=1)
                         
xTrain = result['vec'].tolist()
yTrain = result['flag'].tolist()
xTest = testdata['vec'].tolist()
yTest = testdata['flag'].tolist()

def get_rf_ascore(my_para):#随机森林⽅法
    
    clf = RandomForestClassifier(n_estimators=my_para)
    clf.fit(xTrain, yTrain)
    
    y_pre = clf.predict(xTest)
    y_test = np.array(yTest)
    y_pre = np.array(y_pre)
    
    score = accuracy_score(y_test, y_pre)
    return score

def get_rf_rscore(my_para):
    
    clf = RandomForestClassifier(n_estimators=my_para)
    clf.fit(xTrain, yTrain)
    
    y_pre = clf.predict(xTest)
    y_test = np.array(yTest)
    y_pre = np.array(y_pre)
    
    score = recall_score(y_test, y_pre,average = 'macro')
    return score

#调参及结果展⽰
rfaScore = [ ]
rfrScore = [ ]
estimator = np.arange(1, 20, 1)
              
for i in estimator:
    temp_ascore = get_rf_ascore(i)
    temp_rscore = get_rf_rscore(i)
    rfaScore.append(temp_ascore)
    rfrScore.append(temp_rscore)

plt.plot(rfaScore,color='red')
plt.plot(rfrScore,color='green')
plt.xlabel('estimator')
plt.ylabel('score')
plt.show()

run

import pandas as pd
import jieba
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
import matplotlib.pyplot as plt

with open('中文停用词表.txt','r',encoding='utf-8')as f:
    stopWords = [line.strip('\n') for line in f.readlines()]
    stopWords += '\n'
    
data = pd.read_excel('train.xlsx',sheet_name=None)
test = pd.read_excel('test.xlsx',sheet_name=None)
classname = ['财经','房产','教育','科技','军事','汽车','体育','游戏','娱乐','养生健康','历史','搞笑','旅游','母婴']

#对在停用词表中的分词进⾏过滤操作
def pre_treating(para):
    
    words = jieba.cut(str(para))#分词
    words = [word for word in words if len(word)>1]
    words = [word for word in words if word not in stopWords]
    return words

#对数据集(训练集和测试集)中的无用分词删除,只保留可能有用的分词。
for name in classname:
    
    data[name]['words'] = data[name]['content'].apply(pre_treating)
    test[name]['words'] = data[name]['content'].apply(pre_treating)

i = 0;
#将标签加入属性中
for name in classname:
    data[name]['flag'] = i
    test[name]['flag'] = i
    i += 1;
    #print(data[name])

result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:
    result = result.append(data[name])
    testdata = testdata.append(data[name])
#print(result)

topWordNum = 800
items = result['words'].values.tolist()
words = []
for item in items:
    words.extend(item)
#统计800个频率最高的词组
wordCount = pd.Series(words).value_counts()[0:topWordNum]
#去除频率。仅获取词组
wordCount = wordCount.index.values.tolist()


def wordsToVec(words):
    #将词组转为向量,此处向量的数为高频词出现次数
    vec = map(lambda word:words.count(word),wordCount)
    vec = list(vec)
    return vec

result['vec'] = result['words'].apply(wordsToVec)
result = result.drop(['content'],axis=1)
result = result.drop(['channelName'],axis=1)

testdata['vec'] = testdata['words'].apply(wordsToVec)
testdata = testdata.drop(['content'],axis=1)
testdata = testdata.drop(['channelName'],axis=1)
                         
xTrain = result['vec'].tolist()
yTrain = result['flag'].tolist()
xTest = testdata['vec'].tolist()
yTest = testdata['flag'].tolist()

def get_dt_ascore(my_para):

    clf = tree.DecisionTreeClassifier(max_depth=my_para)
    clf.fit(xTrain, yTrain)
    
    y_pre = clf.predict(xTest)
    y_test = np.array(yTest)
    y_pre = np.array(y_pre)
    
    score = accuracy_score(y_test, y_pre)
    return score

def get_dt_rscore(my_para):

    clf = tree.DecisionTreeClassifier(max_depth=my_para)
    clf.fit(xTrain, yTrain)
    
    y_pre = clf.predict(xTest)
    y_test = np.array(yTest)
    y_pre = np.array(y_pre)
    
    score = recall_score(y_test, y_pre,average = 'macro')
    return score

dtaScore = [ ]
dtrScore = [ ]
testDepth = np.arange(1, 100, 1)

for i in testDepth:
    temp_ascore = get_dt_ascore(i)
    temp_rscore = get_dt_rscore(i)
    dtaScore.append(temp_ascore)
    dtrScore.append(temp_rscore)

plt.plot(dtaScore,color='yellow')
plt.plot(dtrScore,color='blue')
plt.ylabel('score')
plt.xlabel('testDepth')
plt.show()