Dataset

本文的数据集包含了各种与汽车相关的信息,如点击的位移,汽车的重量,汽车的加速度等等信息,我们将通过这些信息来预测汽车的来源:北美,欧洲或者亚洲,这个问题中类标签有三个,不同于之前的二元分类问题。

  • 由于这个数据集不是csv文件,而是txt文件,并且每一列的没有像csv文件那样有一个行列索引(不包含在数据本身里面),而txt文件只是数据。因此采用一个通用的方法read_table()来读取txt文件:

mpg – Miles per gallon, Continuous.
cylinders – Number of cylinders in the motor, Integer, Ordinal, and Categorical.(汽缸数 )
displacement – Size of the motor, Continuous.
horsepower – Horsepower produced, Continuous.
weight – Weights of the car, Continuous.
acceleration – Acceleration, Continuous.
year – Year the car was built, Integer and Categorical.(每年生产量)
origin – 1=North America, 2=Europe, 3=Asia. Integer and Categorical
car_name – Name of the Car, will not be needed in this analysis.

  • 通过read_table读取数据后,返回的auto是个DataFrame对象
import pandas
import numpy as np

# Filename
auto_file = "auto.txt"

# Column names, not included in file
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 
         'year', 'origin', 'car_name']

# Read in file 
# Delimited by an arbitrary number of whitespaces 
auto = pandas.read_table(auto_file, delim_whitespace=True, names=names)

# Show the first 5 rows of the dataset
print(auto.head())
'''
  mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0   18          8           307      130.0    3504          12.0    70   
1   15          8           350      165.0    3693          11.5    70   
2   18          8           318      150.0    3436          11.0    70   
3   16          8           304      150.0    3433          12.0    70   
4   17          8           302      140.0    3449          10.5    70   

   origin                   car_name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  
'''
print(auto.describe())
'''
              mpg   cylinders  displacement       weight  acceleration  \
count  398.000000  398.000000    398.000000   398.000000    398.000000   
mean    23.514573    5.454774    193.425879  2970.424623     15.568090   
std      7.815984    1.701004    104.269838   846.841774      2.757689   
min      9.000000    3.000000     68.000000  1613.000000      8.000000   
25%     17.500000    4.000000    104.250000  2223.750000     13.825000   
50%     23.000000    4.000000    148.500000  2803.500000     15.500000   
75%     29.000000    8.000000    262.000000  3608.000000     17.175000   
max     46.600000    8.000000    455.000000  5140.000000     24.800000   

             year      origin  
count  398.000000  398.000000  
mean    76.010050    1.572864  
std      3.697627    0.802055  
min     70.000000    1.000000  
25%     73.000000    1.000000  
50%     76.000000    1.000000  
75%     79.000000    2.000000  
max     82.000000    3.000000  
'''

Clean Dataset

  • 由于auto有很多缺省值和无关的列信息,因此需要先做数据清洗:首先car_name 是无关的属性,其次horsepower这个属性在统计分析时没出现,可能是因为有缺失值,观察数据集发现确实是有缺失,在数据集中缺失的值用?表示的。
# Delete the column car_name
del auto["car_name"]
# Remove rows with missing data
auto = auto[auto["horsepower"] != '?']

Categorical Variables

观察属性有连续性数据和分类型数据,比如类标签是{1,2,3}是个分类型数据。倘若我们的特征中有分类型数据,比如:根据球的颜色判断球的大小,给的球的颜色{红,緑,蓝},那么如果简单的定义红=1,緑=2,蓝=3,这样是不对的。因为这默认了緑是红的两倍,而原本并不是这样的关系。我们应该重新定义2个属性{红,緑}。一个红球的特征向量是[1,0],緑球的特征向量是[0,1],而篮球的特征向量是[0,0],这些被称为虚拟变量并且在实践中经常使用。

Using Dummy Variables

在本例中,cylinders,year,origin都是分类变量(枚举型)。因此cylinders,year这两个属性不能直接应用到模型中去。year这个属性表示每年的生产量,由于每年生产量不是一个任意数,可能是一个枚举型的数字(我们没有充足的信息表明它是任意的数值),因此将其看为分类变量是比较安全的方法,所以必须要用到虚拟变量。比如汽缸数的有5类{3,4,5,6,8},因此可以设置4个虚拟变量。

  • cylinders_3 – Does the car have 3 cylinders? either a 0 or a 1
  • cylinders_4 – Does the car have 4 cylinders? either a 0 or a 1
  • cylinders_5 – Does the car have 5 cylinders? either a 0 or a 1
  • cylinders_6 – Does the car have 6 cylinders? either a 0 or a 1
  • 由于8个气缸可以通过其他4个变量表示出来[0,0,0,0],因此不设置这个变量。
# input a column with categorical variables
def create_dummies(var):
    # 获得该属性的取值,然后将其排序
    var_unique = var.unique()
    var_unique.sort()

    dummy = pandas.DataFrame()

    # 最后一个取值不用设置为虚拟变量
    for val in var_unique[:-1]:
        # d是一个布尔型数组,比如,所有为3个气缸的行值为True
        d = var == val
        # 产生虚拟变量时最好与原变量名称相关如:此处的cylinders_3
        # astype(int)将布尔型数据转换为1(true),0(False)数据
        dummy[var.name + "_" + str(val)] = d.astype(int)

    # return dataframe with our dummy variables
    return(dummy)

# lets make a copy of our auto dataframe to modify with dummy variables
modified_auto = auto.copy()

# make dummy varibles from the cylinder categories
cylinder_dummies = create_dummies(modified_auto["cylinders"])

# merge dummy varibles to our dataframe
modified_auto = pandas.concat([modified_auto, cylinder_dummies], axis=1)

# delete cylinders column as we have now explained it with dummy variables
del modified_auto["cylinders"]

# make dummy varibles from the year categories
year_dummies = create_dummies(modified_auto["year"])

# merge dummy varibles to our dataframe
modified_auto = pandas.concat([modified_auto, year_dummies], axis=1)

# delete cylinders column as we have now explained it with dummy variables
del modified_auto["year"]

Multiclass Classification

多元分类技术可以分为以下几种:

  • 一对多法(one-versus-rest,简称1-v-r)。训练时依次把某个类别的样本归为一类,其他剩余的样本归为另一类,这样就将多元分类问题转化为二元分类问题。这样k个类别的样本就构造出了k个分类器。分类时将未知样本分类为具有最大分类函数值的那类。

假如我有四类要划分(也就是4个Label),它们是A、B、C、D。于是我在抽取训练集的时候,分别抽取A所对应的向量作为正集,B,C,D所对应的向量作为负集;B所对应的向量作为正集,A,C,D所对应的向量作为负集;C所对应的向量作为正集, A,B,D所对应的向量作为负集;D所对应的向量作为正集,A,B,C所对应的向量作为负集,这四个训练集分别进行训练,然后的得到四个训练结果文件,在测试的时候,在四个模型上都进行预测。最后每个模型都有一个结果f1(x),f2(x),f3(x),f4(x).于是最终的结果便是这四个值中最大的一个。
p.s.: 这种方法有种缺陷,因为训练集是1:M,这种情况下存在biased.因而不是很实用.

  • 一对一法(one-versus-one,简称1-v-1)。其做法是在任意两类样本之间设计一个二元分类器,因此k个类别的样本就需要设计k(k-1)/2个分类器。当对一个未知样本进行分类时,最后得票最多的类别即为该未知样本的类别。
  • 将数据集划分为训练集和测试集:
# get all columns which will be used as features, remove 'origin',因为'origin'是类标签
features = np.delete(modified_auto.columns, modified_auto.columns == 'origin')

# shuffle data
shuffled_rows = np.random.permutation(modified_auto.index)

# Select 70% of the dataset to be training data
highest_train_row = int(modified_auto.shape[0] * .70)
# Select 70% of the dataset to be training data
train = modified_auto.loc[shuffled_rows[:highest_train_row], :]

# Select 30% of the dataset to be test data
test = modified_auto.loc[shuffled_rows[highest_train_row:], :]

Training A Multiclass Logistic Regression

from sklearn.linear_model import LogisticRegression

# find the unique origins
unique_origins = modified_auto["origin"].unique()
unique_origins.sort()
# 由于有三个类别,采用一对多,总共三个模型
models = {}

for origin in unique_origins:
    models[origin] = LogisticRegression()

    X_train = train[features]
    # 每次都要修改训练集的标签数据,将当前类的标签设置为1,其它类为0
    y_train = (train["origin"] == origin).astype(int)

    models[origin].fit(X_train, y_train)

# testing_probs用来收集每个分类器的预测概率
testing_probs = pandas.DataFrame(columns=unique_origins)
for origin in unique_origins:
    X_test = test[features]   
    #一个测试集要放到三个模型中求属于1的概率,[:,1]返回的是属于1的概率,[:,0]是属于0的概率
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]

Choose The Origin

  • 找到概率最大的列标签
predicted_origins = testing_probs.idxmax(axis=1)
'''
Series (<class 'pandas.core.series.Series'>)
0     1
1     1
2     1
3     1
4     1
'''

Confusion Matrix

商用客车电气架构 商用客车分类_商用客车电气架构

# Remove pandas indicies
predicted_origins = predicted_origins.values
origins_observed = test['origin'].values

# fill in this confusion matrix
confusion = pandas.DataFrame(np.zeros(shape=(unique_origins.shape[0], unique_origins.shape[0])), 
                             index=unique_origins, columns=unique_origins)
# Each unique prediction
for pred in unique_origins:
    # Each unique observation
    for obs in unique_origins:
        # Check if pred was predicted
        t_pred = predicted_origins == pred
        # Check if obs was observed
        t_obs = origins_observed == obs
        # True if both pred and obs 
        t = (t_pred & t_obs)
        # Count of the number of observations with pred and obs
        confusion.loc[pred, obs] = sum(t)
print(confusion)
'''
    1   2   3
1  74   5   0
2   0  13   0
3   0   1  25
'''
  • 在前面 美国议员党派——K均值聚类中提到了一个较为简单的方法,结果是一致的,只是细节不一样:
# Remove pandas indicies
predicted_origins = predicted_origins.values
origins_observed = test['origin'].values

# fill in this confusion matrix
confusion = pandas.crosstab(predicted_origins, origins_observed)
print(confusion)
'''
col_0   1   2   3
row_0            
1      74   5   0
2       0  13   0
3       0   1  25
'''

Confusion Matrix Cont

  • 对于1来说混淆矩阵是这样的:
'''
col_0   1   0 
row_0            
1      74   5  
0      0    39
'''
  • 对于2来说混淆矩阵是这样的:
'''
col_0   2   0 
row_0            
2      13   0  
0      6    99
'''
  • 对于3来说混淆矩阵是这样的:
'''
col_0   3   0 
row_0            
3      79   0  
0      1    25
'''
  • 计算2的FP:
fp2 = confusion.ix[2,[1,3]].sum()
print(fp2)
'''
0
'''

Average Accuracy

  • 下列公式就是计算多元分类问题的精度公式,其中l表示的是类的个数:

商用客车电气架构 商用客车分类_数据_02

# The confusion DataFrame is in memory# The confusion DataFrame is in memory
# The total number of observations in the test set
n = test.shape[0]
# Variable to store true predictions
sumacc = 0
# Loop over each origin
for i in confusion.index:
    # True Positives
    tp = confusion.loc[i, i]
    # True negatives
    # 计算除去第i行第i列的其他所有元素之和
    tn = confusion.loc[unique_origins[unique_origins != i], unique_origins[unique_origins != i]]
    # Add the sums
    sumacc += tp + tn.sum().sum()

# Compute average accuracy
denominator = n*unique_origins.shape[0]
avgacc = sumacc/denominator
'''
avgacc :0.96610169491525422
'''

Precision And Recall

  • 多元分类的查准率:
  • 商用客车电气架构 商用客车分类_数据_03

  • 多分类的查全率:
  • 商用客车电气架构 商用客车分类_数据集_04

# Variable to add all precisions
ps = 0
# Loop through each origin (class)
for j in confusion.index:
    # True positives
    tps = confusion.ix[j,j]
    # Positively predicted for that origin 
    positives = confusion.ix[j,:].sum()
    # Add to precision
    ps += tps/positives

# divide ps by the number of classes to get precision 
precision = ps/confusion.shape[0]
print('Precision = {0}'.format(precision))
'''
Precision = 0.9660824407659852
'''

rcs = 0
for j in confusion.index:
    # Current number of true positives
    tps = confusion.ix[j,j]
    # True positives and false negatives
    origin_count = confusion.ix[:,j].sum()
    # Add recall
    rcs += tps/origin_count

# Compute recall
recall = rcs/confusion.shape[0]
'''
0.89473684210526316
'''

F-Score

在银行信用卡批准——模型评估ROC&AUC一文中,画出了查准率和查全率的关系图,可以发现当查全率增加时,查准率会降低,而我们期望的是这两个值都尽可能的大,因此需要找到这两个值之间的一个平衡点,因此产生了F度量,F的取值在0到1之间,当F=1是是最完美的模型。其公式如下:

  • 对于每一个类别计算一个Fi值:

商用客车电气架构 商用客车分类_商用客车电气架构_05

  • 然后计算总的F值:

商用客车电气架构 商用客车分类_数据集_06

# Variable to add all precisions
scores = []
# Loop through each origin (class)
for j in confusion.index:
    # True positives
    tps = confusion.ix[j,j]
    # Positively predicted for that origin 
    positives = confusion.ix[j,:].sum()
    # True positives and false negatives
    origin_count = confusion.ix[:,j].sum()
    # Compute precision
    precision = tps / positives
    # Compute recall
    recall = tps / origin_count
    # Append F_i score
    fi = 2*precision*recall / (precision + recall)
    scores.append(fi)
fscore = np.mean(scores)
'''
fscore : 0.92007080610021796
'''

Metrics With Sklearn

前面都是自己计算这些度量值,然而sklearn中有內建的函数帮忙计算。比如:precision_score, recall_score, 以及 f1_score这三个函数,他们需要输入最基本的两个参数:真实的分类,预测的分类,然后是一些可选参数,其中重点关注average这个参数:

商用客车电气架构 商用客车分类_商用客车电气架构_07

# Import metric functions from sklearn
from sklearn.metrics import precision_score, recall_score, f1_score

# Compute precision score with micro averaging
pr_micro = precision_score(test["origin"], predicted_origins, average='micro')
pr_weighted = precision_score(test["origin"], predicted_origins, average='weighted')
rc_weighted = recall_score(test["origin"], predicted_origins, average='weighted')
f_weighted = f1_score(test["origin"], predicted_origins, average='weighted')