python -- 面向程序员的数据挖掘指南-推荐系统入门-分类-007

在上一章中,我们已经介绍过了最近邻分类算法,接下来,我们用几个例子来复习一下

下表是原始数据:

这里列出的是2008和2012奥运会上排名靠前的二十位女运动员。篮球运动员参加了WNBA;田径运动员则完成了2012年奥运会的马拉松赛。虽然数据量很小,但我们仍可以对其应用一些数据挖掘算法。

下表是我们需要进行预测的运动员列表,一起来做分类器吧!

我会使用第一个文件中的数据来训练分类器,然后使用测试文件里的数据来进行评价。

文件格式大致如下:

import numpy as np

import pandas as pd

首先从文本中读取数据

def read_data_from_file(filename):
result = {}
with open(filename,'r', encoding='utf-8-sig') as f:
for line in f.readlines():
line = line.split('\t')
result[line[0]] = {'class':line[1].replace(r'\u', ''), 'h':int(line[3].strip()), 'w':int(line[2])}
result = pd.DataFrame(result)
return result
train_data = read_data_from_file('./datamining/7/athletesTrainingSet.txt')
train_data = train_data.T
print(train_data.head())
class h w
Asuka Teramoto Gymnastics 66 54
Brittainey Raven Basketball 162 72
Chen Nan Basketball 204 78
Gabby Douglas Gymnastics 90 49
Helalia Johannes Track 99 65
data_mean = train_data[['h','w']].mean()

下面来对数据做标准化处理吧

def normalize(data):
data_mean = data[['h','w']].mean()
data_std = data[['h','w']].std()
data = (data[['h','w']]-data_mean)/data_std
return data
nor_train_data = normalize(train_data)
print(nor_train_data.head())
h w
Asuka Teramoto -1.29515 -1.46938
Brittainey Raven 0.939073 0.881631
Chen Nan 1.91655 1.6653
Gabby Douglas -0.736597 -2.12244
Helalia Johannes -0.527138 -0.032653

然后,看一下修正的标准化结果

def correctNormalize(data):
data_mean = data[['h','w']].median()
data_std = (data[['h','w']]-data_mean).abs().mean()
data = (data[['h','w']]-data_mean)/data_std
return data
cor_train_data = correctNormalize(train_data)
print(cor_train_data.head())
h w
Asuka Teramoto -1.21842 -1.93277
Brittainey Raven 1.63447 1.09244
Chen Nan 2.88262 2.10084
Gabby Douglas -0.505201 -2.77311
Helalia Johannes -0.237741 -0.0840336

接下来计算距离:

# 曼哈顿距离
def manhattan(v1, v2):
temp = (v1-v2).abs()
if len(temp.shape)>1:
return temp.sum(axis=1)
return temp.sum()
# 欧式距离
def Euclidean(v1, v2):
temp = (v1 - v2)**2
if len(temp.shape)>1:
return np.sqrt(temp.sum(axis=1))
return np.sqrt(np.sum(temp))
print(manhattan(train_data.loc['Asuka Teramoto'][['h','w']], train_data.loc['Brittainey Raven'][['h','w']]))
114
print(Euclidean(train_data.loc['Asuka Teramoto'][['h','w']], train_data.loc['Brittainey Raven'][['h','w']]))
97.67292357659824

接着,求取最近邻:

# 返回k个最近邻
def nearestNeighbor(v1, data, k, method='m'):
"""返回itemVector的近邻"""
if method=='m':
return manhattan(v1, data).sort_values()[:k]
return Euclidean(v1, data).sort_values()[:k]
# 曼哈顿距离
print(nearestNeighbor(pd.Series({'h':66, 'w':54}), train_data[['h','w']], 3))
# 欧式距离
print(nearestNeighbor(pd.Series({'h':66, 'w':54}), train_data[['h','w']], 3, 'e'))
Asuka Teramoto 0.0
Linlin Deng 2.0
Rebecca Tunney 15.0
dtype: float64
Asuka Teramoto 0.0000
Linlin Deng 2.0000
Rebecca Tunney 11.7047
dtype: float64

再编写一个分类器就可以了:

def classifier(data, v1, method):
#算出最近的一个元素
k = 1
result = nearestNeighbor(v1, data, k, method=method)
return result
near = classifier(train_data[['h','w']], pd.Series({'h':68, 'w':52}), method='m')
print(train_data.loc[near.index]['class'].values)
['Gymnastics']

最后,我们使用测试数据集看看正确率是多少吧:

test_data = read_data_from_file('./datamining/7/athletesTestSet.txt')
test_data= test_data.T
print(len(test_data))
20
near = classifier(train_data[['h','w']], test_data.loc['Aly Raisman'][['h','w']], method='m')
print(train_data.loc[near.index]['class'].values)
['Track']

编写一个测试统计正确率的函数:

def test(test_data, train_data, method='m'):
i = 0
for item in test_data.iterrows():
near = classifier(train_data[['h','w']], item[1][['h','w']], method=method)
pre = train_data.loc[near.index]['class'].values[0]
if pre == item[1]['class']:
i += 1
print('正确率为%s%%'%(i/len(test_data)*100))
test(test_data, train_data, method='m')

正确率为80.0%

简单测试一下使用欧式距离的正确率

test(test_data, train_data, method='e')
正确率为80.0%
由于数据较少,而且数据的维度很少,很难看出两种距离的差异
接下来看看标准化数据的正确率,特别注意的是对测试集做标准化处理时,要使用训练集的均值和方差
data_mean = train_data[['h','w']].mean()
data_std = train_data[['h','w']].std()
nor_test_data = (test_data[['h','w']]-data_mean)/data_std
# 把原来的class重新插入
nor_test_data.insert(0,'class',test_data['class'])
print(nor_test_data.head())
nor_train_data.insert(0,'class',train_data['class'])
print(nor_train_data.head())
class h w
Aly Raisman Gymnastics -0.154767 -0.424489
Crystal Langhorne Basketball 1.59072 1.14285
Diana Taurasi Basketball 0.962347 0.881631
Erin Thorn Basketball 0.520156 0.489795
Hannah Whelan Gymnastics -0.10822 -0.293877
class h w
Asuka Teramoto Gymnastics -1.29515 -1.46938
Brittainey Raven Basketball 0.939073 0.881631
Chen Nan Basketball 1.91655 1.6653
Gabby Douglas Gymnastics -0.736597 -2.12244
Helalia Johannes Track -0.527138 -0.032653
test(nor_test_data, nor_train_data, method='e')

正确率为80.0%

正确率没有变化,最好看看修正标准化的正确率:

data_mean = train_data[['h','w']].median()
data_std = (train_data[['h','w']]-data_mean).abs().mean()
cor_test_data = (test_data[['h','w']]-data_mean)/data_std
# 把原来的class重新插入
cor_test_data.insert(0,'class',test_data['class'])
print(cor_test_data.head())
cor_train_data.insert(0,'class',train_data['class'])
print(cor_train_data.head())
class h w
Aly Raisman Gymnastics 0.237741 -0.588235
Crystal Langhorne Basketball 2.46657 1.42857
Diana Taurasi Basketball 1.66419 1.09244
Erin Thorn Basketball 1.09955 0.588235
Hannah Whelan Gymnastics 0.297177 -0.420168
class h w
Asuka Teramoto Gymnastics -1.21842 -1.93277
Brittainey Raven Basketball 1.63447 1.09244
Chen Nan Basketball 2.88262 2.10084
Gabby Douglas Gymnastics -0.505201 -2.77311
Helalia Johannes Track -0.237741 -0.0840336
test(nor_test_data, nor_train_data, method='m')

正确率为80.0%

鸢尾花数据集

我们可以用鸢尾花数据集做测试,这个数据集在数据挖掘领域是比较有名的。

鸢尾花数据集的格式如下,我们要预测的是Species这一列:

鸢尾花数据集可以通过sklearn库得到。

from sklearn.datasets import load_iris
iris = load_iris()
print(type(iris))
print(len(iris['data']))
150

一共有150条数据,需要划分为测试集和训练集。

可以自己编写函数,随机的抽取测试集和训练集,也可以使用封装好的函数随机抽取测试集和训练集

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0, test_size=0.2)
print(len(x_train), len(x_test))
120 30
print(x_train[:10])
[[6.4 3.1 5.5 1.8]
[5.4 3. 4.5 1.5]
[5.2 3.5 1.5 0.2]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.2]
[5.2 2.7 3.9 1.4]
[5.7 3.8 1.7 0.3]
[6. 2.7 5.1 1.6]
[5.9 3. 4.2 1.5]
[5.8 2.6 4. 1.2]]

同样,对数据进行标准化处理,以训练集为准:

mean = x_train.mean(axis=0)
std = x_train.std()
x_train_nor = (x_train-mean)/std
print(x_train_nor[:10])
x_test_nor = (x_test-mean)/std
[[ 0.26146462 0.02350244 0.84818618 0.28622611]
[-0.24215904 -0.02685993 0.34456252 0.13513902]
[-0.34288378 0.2249519 -1.16630846 -0.51957174]
[ 0.11037752 -0.02685993 0.54601199 0.28622611]
[ 0.26146462 -0.12758466 0.89854855 0.48767558]
[-0.34288378 -0.17794703 0.04238832 0.08477665]
[-0.09107195 0.376039 -1.06558373 -0.46920938]
[ 0.06001515 -0.17794703 0.64673672 0.18550138]
[ 0.00965279 -0.02685993 0.19347542 0.13513902]
[-0.04070958 -0.22830939 0.09275069 -0.01594808]]

接下来计算距离:

def get_distance(v1,v2, method='e'):
if method=='m':
temp = abs(v1-v2)
if len(temp.shape)>1:
return temp.sum(axis=1)
return temp.sum()
temp = (v1 - v2)**2
if len(temp.shape)>1:
return np.sqrt(temp.sum(axis=1))
return np.sqrt(np.sum(temp))
a = np.array([6.8, 2.5, 5, 2.8])
print(get_distance(x_train,a, method='m'))
[2.5 3.7 8.7 2.3 1.9 4.3 8.2 2.3 3.5 3.7 1.7 9.1 1.3 8.4 8.7 5.6 2.3 1.9
2.4 1.9 4. 3. 2.8 4.3 1.8 2.2 3.5 1.4 2.6 1.8 3.4 8.6 2.2 3.7 4.5 3.8
3. 3.1 8.8 9.6 1.8 4. 8.2 8.9 2.1 8.9 2.5 5.4 9.2 2.7 2.9 3.8 8.8 2.4
2. 2.2 2. 9. 8.9 1.9 2.3 8.7 2.4 8.6 1.9 1.8 9.4 9.1 2.9 8.8 9.2 8.5
4. 2.1 3.6 8.9 8.6 8.7 3.4 2.7 8.9 8.1 3. 8.2 3.4 5.5 4.6 2.7 9.4 2.5
8.9 3.7 9. 8.8 2.9 8.8 2.4 3.8 3.9 4.5 1.9 1.3 1.9 1.7 8.3 4.5 1.7 2.2
8.7 2.5 4.1 2.8 2.8 8.6 8.8 8.7 2.5 3.9 4.5 9.1]

接着,构建一个分类器:

def nearestNeighbor(v1, data, y, method='m'):
"""返回itemVector的近邻"""
dis = get_distance(v1, data, method=method)
#获取最小值的下标索引
index = np.argmin(dis)
return y[index]
predict = nearestNeighbor(a, x_train, y_train, method='e')
print('该鸢尾花属于分类'+str(predict))

该鸢尾花属于分类2

最后看一下正确率结果:

def test(x_train,x_test,y_train,y_test,method='e'):
num = 0
for i in range(len(x_test)):
item = x_test[i]
predict = nearestNeighbor(item, x_train, y_train, method=method)
if y_test[i] == predict:
num += 1
print('正确率为%s%%'%(num/len(x_test)*100))

使用欧式距离的正确率:

test(x_train,x_test,y_train,y_test,method='e')

正确率为100.0%

使用曼哈顿距离的正确率:

test(x_train,x_test,y_train,y_test,method='m')

正确率为96.66666666666667%

使用欧式距离的标准化数据的正确率:

test(x_train_nor,x_test_nor,y_train_nor,y_test_nor,method='e')

正确率为100.0%

使用曼哈顿距离的标准化数据的正确率:

test(x_train_nor,x_test_nor,y_train_nor,y_test_nor,method='m')

正确率为96.66666666666667%

好像修正前后,正确率不变。

当不同特征的评分尺度不一致时,为了得到更准确的距离结果,就需要将这些特征进行标准化,使他们在同一个尺度内波动。

我们来比较一下使用不同的标准化方法得到的准确度: