机器学习进阶第一节第十一课

原创

我是小白呀 2020-12-10 14:47:50 博主文章分类：Pyhton 机器学习进阶 ©著作权

文章标签 数据数据集 fish 文章分类 机器学习人工智能

©著作权归作者所有：来自51CTO博客作者我是小白呀的原创作品，请联系作者获取转载授权，否则将追究法律责任

k-近邻算法案例分析

概述
代码实现
读入 Iris 数据集细节资料
对 Iris 数据集进行分割
对特征数据进行标准化
完整代码

概述

本案例使用最著名的 “鸢尾” 数据集, 该数据集曾经被 Fisher 用在经典论文中, 目前作为教科书的数据样本预存在 Scikit-learn 的工具包中.

代码实现

读入 Iris 数据集细节资料

from sklearn.datasets import load_iris
# 使用加载器读取数据并且存入变量 iris
iris = load_iris()
# 查验数据规模
print(iris.data.shape)
# 查看数据说明 (这是一个好习惯)
print(iris.DESCR)
输出结果:
(150, 4)
Iris plants dataset
--------------------
**Data Set Characteristics:**
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

    :Summary Statistics:
    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================
    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
data set contains 3 classes of 50 instances each, where each class refers to a
latter are NOT linearly separable from each other.
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
     Structure and Classification Rule for Recognition in Partially Exposed
     on Information Theory, May 1972, 431-433.
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

通过上述代码对数据的查验以及数据本身的描述, 我们了解到 Iris 数据共有 150 朵鸢尾数据样本, 并且均匀分布在 3 个不同的亚种. 每个数据样本总共 4 个不同的关于花瓣, 花萼的形状特征所描述. 由于没有指定的测试集合, 因此按照惯例, 我们需要对数据进行随机分割, 25% 的样本用于模型的训练.

由于不清楚数据集的排列是否随机, 可能会有按照类别去进行依次排列, 这样训练的样本不均衡, 所以我们需要分割数据, 已经默认有随机采样的功能.

对 Iris 数据集进行分割

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 使用加载器读取数据并且存入变量 iris
iris = load_iris()
# 查验数据规模
print(iris.data.shape)
# 查看数据说明 (这是一个好习惯)
print(iris.DESCR)
# 分割数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.25,random_state=42)

对特征数据进行标准化

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 使用加载器读取数据并且存入变量 iris
iris = load_iris()
# 查验数据规模
print(iris.data.shape)
# 查看数据说明 (这是一个好习惯)
print(iris.DESCR)
# 分割数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.25,random_state=42)

K 近邻算法是非常直观的机器学习模型, 我们可以发现 K 近邻算法没有参数训练过程. 我们没用通过任何学习算法分析训练数据, 而只是根据测试样本训练数据的分布直接作出分类决策. 因此, K 近邻属于无参数模型中非常简单的一种.

完整代码

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
def knniris():
    """
    鸢尾化分类
    :return: None
    """
    # 使用加载器读取数据并且存入变量 iris
    iris = load_iris()
    # 查验数据规模
    print(iris.data.shape)
    # 查看数据说明 (这是一个好习惯)
    print(iris.DESCR)
    # 分割数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=42)
    # 标准化
    ss = StandardScaler()
    x_train = ss.fit_transform(x_train)
    x_test = ss.fit_transform(x_test)
    # estimator流程
    knn = KNeighborsClassifier()
    # 得出模型
    knn.fit(x_train, y_train)
    # 进行预测或者得出精度
    y_predict = knn.predict(x_test)
    score = knn.score(x_test, y_test)
    # 通过网格搜索,n_neighbors为参数列表
    param = {"n_neighbors": [3, 5, 7]}
    gs = GridSearchCV(knn, param_grid=param, cv=10)
    # 建立模型
    gs.fit(x_train, y_train)
    # print(gs)
    # 预测数据
    print(gs.score(x_test, y_test))
    # 分类模型的精确率和召回率
    # print("每个类别的精确率与召回率：",classification_report(y_test, y_predict,target_names=lr.target_names))
    return None
if __name__ == "__main__":
    knniris()
输出结果:
1.0