数据集

  • Dataset1.txt
    328 个同学的身高、体重、性别数据(78 个女生、250 个男生)
  • Dataset2.txt
    124 个同学的数据(40 女、84 男)
  • Dataset3.txt
    90 个同学的数据(16 女,74 男)
    百度网盘 提取码:8plu

基于压缩近邻法练的分类问题

压缩近邻法的做法是:

  1. 对初始训练集R,将其划分为两个部分A和B,初始A样本集合为空。
  2. 从R中随机选择一个样本放入A中,其它样本放入B中,用其对B中的每一个样本进行分类。若样本i能够被正确分类,则将其放回到B中;否则将其加入到A中;
  3. 重复上述过程,直到B中所有样本都能正确分类为止。
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 19 22:27:40 2020

@author: lihuanyu
"""
#%%数据预处理
import numpy as np
from gain_xy import gain_xy 
from sklearn.metrics import accuracy_score
import csv

X_train,y_train,X_test1,y_test1,X_test2,y_test2 = gain_xy()
#%%
import numpy as np
from math import sqrt
from collections import Counter
class KNNClassifier:
    #定义K近邻的值必须大于1
    def __init__(self,k):
        self.k = k
        self.x_train_fit = None
        self.y_train_fit = None 
    def fit(self,x_train,y_train):
        self.x_train_fit = x_train
        self.y_train_fit = y_train
        return self
    #算出每个点与其他训练集的距离
    def _predict(self,x):
        distance = []
        for x_train in self.x_train_fit:
            distance.append(sqrt(np.sum((x_train - x) ** 2)))
        nearst = np.argsort(distance)
        topk_y = []
        for j in nearst[:self.k]:
            topk_y.append(self.y_train_fit[j])
        votes = Counter(topk_y)
        result = votes.most_common(1)[0][0]
        return result
    #批量预测
    def predict(self,X_test,y_test):
        y_predict = [self._predict(i) for i in X_test]
        print("准确率为",accuracy_score(np.array(y_predict),y_test))
        
#%%KNN预测
knn = KNNClassifier(k=5)
knn.fit(X_train,y_train)
knn.predict(X_test1,y_test1)
knn.predict(X_test2,y_test2)
#%%压缩处理
k=1

Store = [X_train[0]] #新样本集
Store_y = [y_train[0]]
Grabbag = [i for i in X_train[1:]]
Grabbag_y = [i for i in y_train[1:]]
for x_t,y_t in zip(Grabbag,Grabbag_y):
    distance = []
    for x,y in zip(Store,Store_y):
        #print(x,y)
        distance.append(sqrt(np.sum((x - x_t) ** 2)))
    nearst = np.argsort(distance)
    topk_y = [Store_y[t] for t in nearst[:k]]
    votes = Counter(topk_y)
    result = votes.most_common(1)[0][0]
    if result == y:
       Store.append(x_t)
       Store_y.append(y_t)
       print(len(Store),len(Store_y))
#%%结果
X_train1 = np.array([i for i in Store])
y_train1 = np.array([i for i in Store_y])
import matplotlib.pyplot as plt
plt.scatter(X_train1[y_train1==0,0],X_train1[y_train1==0,1],color='red')
plt.scatter(X_train1[y_train1==1,0],X_train1[y_train1==1,1],color='blue')


knn = KNNClassifier(k=5)
knn.fit(X_train1,y_train1)
knn.predict(X_test1,y_test1)
knn.predict(X_test2,y_test2)

结果

test1K近邻准确率为 0.9032258064516129
test2K近邻准确率为 0.8444444444444444
test1压缩近邻准确率为 0.8548387096774194
test压缩近邻准确率为 0.8

压缩之后样本点分布如下图所示:

模式识别与机器学习(作业3)_人工智能