官方教程Create Simple Text Model for Classification
1.词频直方图统计:
data.event_type = categorical(data.event_type);前后data.event_type变化:
把“”删除了
data.event_type = categorical(data.event_type);
figure
h = histogram(data.event_type);
xlabel("Class")
ylabel("Frequency")
title("Class Distribution")
词频直方图如下,当然如果是NLP的话,得删掉StopWords,太长的以及太短的。
2.删掉低频词
有些词频太低了,在直方图中把它删掉
面向对象,类,属性,方法
classCounts = h.BinCounts;
classNames = h.Categories;
idxLowCounts = classCounts < 10;
infrequentClasses = classNames(idxLowCounts);
idxInfrequent = ismember(data.event_type,infrequentClasses);
data(idxInfrequent,:) = [];
这种写法也行哦
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
3.划分训练集和测试集
10%划成测试集,90%划成训练集
cvp = cvpartition(data.event_type,'Holdout',0.1);
dataTrain = data(cvp.training,:);
dataTest = data(cvp.test,:);
提取标签
textDataTrain = dataTrain.event_narrative;
textDataTest = dataTest.event_narrative;
YTrain = dataTrain.event_type;
YTest = dataTest.event_type;
然后用监督学习训练数据。