文章目录
- 数据引入和初步分析
- csv数据引入和概览
- 初步探索分析
- Pclass
- Sex
- Name
- SibSp
- Parch
- Embarked
- Fare
- 可视化分析
- Age
- Age & Sex
- Pclass & Age
- Pclass & Sex & Embarked
- Embarked & Sex & Fare
- 数据整理
- PassengerId
- Title
- 将分类值转变为数值以便模型分析和预测
- Title -> 123
- drop Name
- Sex -> 123
- Age & Sex & Pclass
- Band_Age
- Families
- Alone
- drop SibSp, Parch, Families
- Age x Pclass
- Embarked->123
- Fare
- Band_Fare
- Fare->123
- drop Band_Fare
- 各变量热力图
- 预测模型
- 逻辑回归
- 支持向量机
- k近邻算法
- 朴素贝叶斯分类器
- 感知机
- 线性支持向量机
- 随机梯度下降法
- 决策树
- 随机森林
- 汇总模型预测效果
- 留待解决的问题
- Kaggle上可参考的Solution链接
泰坦尼克号的沉没是历史上著名的海难事件,当时登船乘客数以千计。他们的个人信息各不相同,生还罹难遭遇各异。虽然,人们对他们的遭遇表示不幸,但是这个灾难也给数据分析和预测提供了比较丰富的数据样品。
本文的题目描述和数据集源自https://www.kaggle.com/c/titanic,将以给定的训练集数据进行分析,来对预测测试集样品对应的存活几率进行推测。本文为练习笔记总结,问题在所难免,待以后纠正和完善。
变量说明 :
Survival(存活与否) 0 = No, 1 = Yes
Pclass(船票等级) 可以代表社会经济地位,1 = 1st(Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)
Sex(性别) male/female
Age(年龄) 年龄小于1的为小数,大于1有小数部分的年龄为估计值
Sibsp(兄弟姐妹/配偶登船数)
Parch(父母/子女登船数) 有的孩子是跟保姆旅行,他们的Parch = 0
ticket(票号)
Fare(票价)
Cabin(客舱)
Embarked(登船港口) C = Cherbourg(法国瑟堡,为第二登船港口), Q = Queenstown(爱尔兰皇后镇,现科夫,为入海港口), S = Southampton(英国南安普顿,为启航港口)
数据引入和初步分析
从csv中引入训练集和测试集数据,并对各个字段进行初步分析。
csv数据引入和概览
引入所需的库:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
# 导入训练集和测试集数据
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
# 创建合并数据集,方便分析预测
data_combined = [data_train, data_test]
# 打印数据集的主要信息
for d in data_combined:
print('表首尾数据:')
print(d.head())
print('_'*80)
print(d.tail())
print('数值统计信息:')
print(d.describe())
print('_'*80)
print('分类值统计信息:')
print(d.describe(include=['O']))
print('_'*80)
print('表信息:')
print(d.info())
print('='*80)
# 打印Age和Cabin字段缺失值
for d in data_combined:
print('Age缺失值数量为{0}个, 占总比{1}'.format(d.Age.isnull().sum(), d.Age.isnull().sum()/len(d.Age)))
print('Cabin缺失值数量为{0}个, 占总比{1}'.format(d.Cabin.isnull().sum(), d.Cabin.isnull().sum()/len(d.Cabin)))
print('_'*60)
这一步打印信息过长,暂且省略,以下是“打印Age和Cabin字段缺失值”语句运行结果:
Age缺失值数量为177个, 占总比0.19865319865319866
Cabin缺失值数量为687个, 占总比0.7710437710437711
Age缺失值数量为86个, 占总比0.20574162679425836
Cabin缺失值数量为327个, 占总比0.7822966507177034
- Age和Cabin列有大量缺失值
- 训练集的Embarked和测试集的Fare有少量的缺失值
- 非数值列Name、Sex、Ticket、Cabin和Embarked等列需要进行数值化以便统计
初步探索分析
对各个字段进行初步分析
Pclass
# 按照Pclass分类,简单观察与Survived之间的关系
data_train[['Pclass', 'Survived']].groupby(by=['Pclass'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Pclass | Survived | |
0 | 1 | 0.629630 |
1 | 2 | 0.472826 |
2 | 3 | 0.242363 |
- 表面上看,船票等级越高,存活率越高
Sex
# 按照Sex分类,简单观察与Survived之间的关系
data_train[['Sex', 'Survived']].groupby(by=['Sex'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Sex | Survived | |
0 | female | 0.742038 |
1 | male | 0.188908 |
- 表面上看,女性存活率较高
Name
Name字段可以获取长度、首字母以及头衔等作为指标
头衔包括Mr、Miss和Master等,格式为"xxx."
暂不进行处理分析
SibSp
# 按照SibSp分组,简单观察其与存活率之间的关系
data_train[['SibSp', 'Survived']].groupby(by=['SibSp'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
SibSp | Survived | |
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
- 兄弟姐妹和配偶同行越少,存活率越高
- 没有兄弟姐妹和配偶同行的存活率中等偏上
Parch
# 按照Parch分组,简单观察其与存活率之间的关系
data_train[['Parch', 'Survived']].groupby(by=['Parch'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Parch | Survived | |
3 | 3 | 0.600000 |
1 | 1 | 0.550847 |
2 | 2 | 0.500000 |
0 | 0 | 0.343658 |
5 | 5 | 0.200000 |
4 | 4 | 0.000000 |
6 | 6 | 0.000000 |
- 父母和子女同行人数少(1至3人)比同行人数多(4至6人)的存活率高
- 无父母和子女同行的存活率中等
Embarked
# 按照Embarked分组,简单查看其与存活率之间的关系
data_train[['Embarked', 'Survived']].groupby(by=['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Embarked | Survived | |
0 | C | 0.553571 |
1 | Q | 0.389610 |
2 | S | 0.336957 |
- 在C港登船的人平均存活率超过一半,S港的船客大约有1/3存活几率,Q港上船的人平均存活几率有不到40%居于中间
Fare
# 查看一下Fare > 0的记录的统计信息
data_train.Fare[data_train.Fare > 0].describe()
count 876.000000
mean 32.755650
std 49.936826
min 4.012500
25% 7.925000
50% 14.500000
75% 31.275000
max 512.329200
Name: Fare, dtype: float64
- 票价里有为0的记录,不知道是不是特殊安排的船票,本次分析不算入内
- 可以观察到票价的差距很大(从4~512左右),可能跟登船港口和船客社会经济地位有关
- 偏差较大的票价可能会影响统计中的存活人数和比例,可能需要划分为不同范围再统计
可视化分析
FacetGrid适用于数据集绘制单变量分布或多变量关系,三个维度为row、col和hue,即横纵轴和颜色
Age
# 分别绘制生还和罹难人员的年龄分布
fg = sns.FacetGrid(data_train, col='Survived', size=3, aspect=1.5)
fg.map(plt.hist, 'Age', bins=20)
<seaborn.axisgrid.FacetGrid at 0x17137208>
- 乘客大部分人为中青年人(15~35)
- 未成年乘客总体上存活率较高
- 15~30年龄段的乘客罹难人数非常多
Age & Sex
结合Age和Sex观察生还几率
# 绘制各性别的年龄-生还回归模型图
generations = [10, 20, 40, 60, 80]
sns.lmplot('Age', 'Survived',data_train, hue='Sex', x_bins=generations)
D:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1633: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<seaborn.axisgrid.FacetGrid at 0x17162048>
- 女性比男性存活率高
- 女性年纪越大越容易存活,男性则相反
下面是另一个年龄段划分 [15,30,40,80]的图形(可能更明显):
Pclass & Age
# 分别绘制各船票等级生还和罹难人员年龄分布图
fg = sns.FacetGrid(data_train, row='Pclass', col='Survived', size=3, aspect=1.5)
fg.map(plt.hist,'Age', bins=20)
fg.add_legend()
<seaborn.axisgrid.FacetGrid at 0x174f3128>
- 持有1等船票的存活率整体高于50%,2等则大略相当于50%人数也处于中等水平(略低于1等船票),3等对应人数最多且超过一半罹难
- 持有1和2等船票的年轻乘客基本都存活了
- Pclass随年龄分布变化适用于模型训练
下面是各等级船票持有者生还几率随年龄段([15, 30, 40, 80])变化的回归模型图(可能更直观):
- 船票等级越高存活率越高
- 各船票类别中,年纪越大,存活率越低
Pclass & Sex & Embarked
# 绘制各登船港口生还几率与船票等级关系点线图
fg = sns.FacetGrid(data_train, col='Embarked', size=3, aspect=1.5)
fg.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='dark')
fg.add_legend()
D:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1633: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<seaborn.axisgrid.FacetGrid at 0x1a812278>
- 在S和Q港口登船的女性比男性存活几率高很多,在C港口登船的男性比女性更容易存活
Embarked & Sex & Fare
# 绘制各登船地点生还和罹难人员各性别票价柱形图
fg = sns.FacetGrid(data_train, row='Embarked', col='Survived', size=3, aspect=1.5)
fg.map(sns.barplot, 'Sex', 'Fare', alpha=0.75, ci=None)
fg.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1c4c5588>
- 船的票价差距很大从大约4到大于512不等,所以本次分析无法用船费统计人数和存活率,后续分析可能需要划分为不同区间
- 在C和S港口登船的人存活的人比罹难的人平均票价要高得多,而Q港男性存活者的船票略高,女性罹难者的票价略高
数据整理
PassengerId
# PassengerId是用作索引的列,没有统计意义,可以去除掉
data_train = data_train.drop(['PassengerId'], axis=1)
# Ticket和Cabin为观察到明显的生还几率相关性
data_train = data_train.drop(['Ticket', 'Cabin'], axis=1)
data_test = data_test.drop(['Ticket', 'Cabin'], axis=1)
# 打印数据集维度变化
print('数据集维度变化:\n\t整理前:\n\t\t训练集:\t{0}\t测试集:\t{1}\n\t整理后:\n\t\t训练集:\t{2}\t测试集:\t{3}'.format(data_combined[0].shape, data_combined[1].shape, data_train.shape, data_test.shape))
# 合并数据集以便统计分析
data_combined = [data_train, data_test]
数据集维度变化:
整理前:
训练集: (891, 12) 测试集: (418, 11)
整理后:
训练集: (891, 9) 测试集: (418, 9)
Title
# 从Name字段提取称谓对应的Title字段
for d in data_combined:
d['Title'] = d.Name.str.extract(' ([A-Z]\w+)\.', expand=False)
pd.crosstab(data_train['Title'], data_train['Sex'])
Sex | female | male |
Title | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 40 |
Miss | 182 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 517 |
Mrs | 125 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
- 创建头衔或称谓列Title
- Title列可以观察人员的大致社会经济地位:
Title字段对应值的一些基本含义:
Capt.为船长或上尉、机长或副巡长
Col.为上校
Countess.为伯爵夫人或女伯爵
Don.为意大利、西班牙、葡萄牙、拉丁美洲和菲律宾对男子的尊称(应该有对应的Dona)
Dr.为医生或博士
Jonkheer.为荷兰低等贵族
Lady.为贵族夫人或女贵族
Major.为陆军少校
Master.为商船船长、大师、院长等
Miss.为小姐
Mlle.为法语Mademoiselle小姐的缩写
Mme.为法语Madame太太的缩写
Mr.为先生
Mrs.为夫人
Ms.为女士
Rev.为牧师
Sir.为爵士
其中,Mlle、Ms和Miss,Mme和Mrs分别对等,数量较多更具有统计意义;其他的头衔人数很少,不具有太大统计意义,可以归为一类(Don或许应该和Mr对应,Dona和Miss对等)
# 将Title字段归入对应的值
for d in data_combined:
d.Title = d.Title.replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
d.Title = d.Title.replace('Mlle', 'Miss')
d.Title = d.Title.replace('Ms', 'Miss')
d.Title = d.Title.replace('Mme', 'Mrs')
# 按照Title字段分类查看与生还几率之间的关系
data_train[['Title', 'Survived']].groupby(by=['Title'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Title | Survived | |
3 | Mrs | 0.793651 |
1 | Miss | 0.702703 |
0 | Master | 0.575000 |
4 | Rare | 0.347826 |
2 | Mr | 0.156673 |
- Mrs和Miss代表的女性存活率较高(分别为大约80%和70%)
- Master代表的有身份地位的男性存活率接近60%
- Mr代表的普通男性只有略高于15%的存活几率
- Rare代表的有特殊头衔的乘客存活于中等偏下为不到35%,但是这些人员头衔各异,人数又少,可能意义不大
将分类值转变为数值以便模型分析和预测
Title -> 123
# 将Title转变为数值
title_map = {'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5}
for d in data_combined:
d.Title = d.Title.map(title_map)
d.Title = d.Title.fillna(0)
drop Name
# 删除Name列
data_train = data_train.drop(['Name'], axis=1)
data_test = data_test.drop(['Name'], axis=1)
data_combined = [data_train, data_test]
Sex -> 123
# 将性别转变为数值
for d in data_combined:
d.Sex = d.Sex.map({'male': 1, 'female': 0})
Age & Sex & Pclass
- Age变量和Sex、Pclass等变量关联
# 绘制各等级船票持有者各性别的年龄分布图
fg = sns.FacetGrid(data_train, row='Pclass', col='Sex', size=3, aspect=1.5)
fg.map(plt.hist, 'Age', alpha=0.75, bins=20)
fg.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1cca1160>
# 预先分配数组用以存放Sex和Pclass各组合对应的年龄
ages_guessed = np.zeros((2, 3))
# 为各性别各等级船票持有者年龄空值记录赋值为对应的众数
for d in data_combined:
for i in range(0, 2):
for j in range(0, 3):
age_guessed = d[(d.Sex == i) & (d.Pclass == j+1)].Age.dropna().median()
#ages_guessed[i, j] = round(age_guessed)
ages_guessed[i, j] = int(age_guessed + 0.25)
for i in range(0, 2):
for j in range(0, 3):
d.loc[(d.Sex == i) & (d.Pclass == j+1) & d.Age.isnull(), 'Age'] = ages_guessed[i, j]
d.Age = d.Age.astype(int)
data_train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 1 | 22 | 1 | 0 | 7.2500 | S | 1 |
1 | 1 | 1 | 0 | 38 | 1 | 0 | 71.2833 | C | 3 |
2 | 1 | 3 | 0 | 26 | 0 | 0 | 7.9250 | S | 2 |
3 | 1 | 1 | 0 | 35 | 1 | 0 | 53.1000 | S | 3 |
4 | 0 | 3 | 1 | 35 | 0 | 0 | 8.0500 | S | 1 |
Band_Age
# 根据年龄分布情况将其划分为5段
data_train['Band_Age'] = pd.cut(data_train.Age, 5)
data_train[['Band_Age', 'Survived']].groupby(by=['Band_Age'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Band_Age | Survived | |
0 | (-0.08, 16.0] | 0.550000 |
3 | (48.0, 64.0] | 0.434783 |
2 | (32.0, 48.0] | 0.412037 |
1 | (16.0, 32.0] | 0.337374 |
4 | (64.0, 80.0] | 0.090909 |
- 0~16岁的乘客有过半的生存概率
- 其次为49至64岁、33至48岁、17至32岁的乘客,生存概率在44%至34%左右
- 65~80岁的乘客,存活概率最低不到10%
# 分段观察Age变量之后可以删除
data_train = data_train.drop('Band_Age', axis=1)
data_combined = [data_train, data_test]
# 根据分段边界划分年龄到各区段
for d in data_combined:
d.loc[d.Age <= 16, 'Age'] = 0
d.loc[(d.Age <= 32) & (d.Age > 16), 'Age'] =1
d.loc[(d.Age <= 48) & (d.Age > 32), 'Age'] =2
d.loc[(d.Age <= 64) & (d.Age > 48), 'Age'] =3
d.loc[(d.Age > 64), 'Age'] =4
# 分别绘制罹难和生还人员各年龄段人数分布图
fg = sns.FacetGrid(data_train, col='Survived')
fg.map(plt.hist, 'Age', bins=4)
<seaborn.axisgrid.FacetGrid at 0x1cfb0278>
Families
# Families计算家人数,包括父母、兄弟姐妹、配偶和子女
for d in data_combined:
d['Families'] = d.SibSp + d.Parch
data_combined = [data_train, data_test]
data_train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | Families | |
0 | 0 | 3 | 1 | 1 | 1 | 0 | 7.2500 | S | 1 | 1 |
1 | 1 | 1 | 0 | 2 | 1 | 0 | 71.2833 | C | 3 | 1 |
2 | 1 | 3 | 0 | 1 | 0 | 0 | 7.9250 | S | 2 | 0 |
3 | 1 | 1 | 0 | 2 | 1 | 0 | 53.1000 | S | 3 | 1 |
4 | 0 | 3 | 1 | 2 | 0 | 0 | 8.0500 | S | 1 | 0 |
data_combined[0].head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | Families | |
0 | 0 | 3 | 1 | 1 | 1 | 0 | 7.2500 | S | 1 | 1 |
1 | 1 | 1 | 0 | 2 | 1 | 0 | 71.2833 | C | 3 | 1 |
2 | 1 | 3 | 0 | 1 | 0 | 0 | 7.9250 | S | 2 | 0 |
3 | 1 | 1 | 0 | 2 | 1 | 0 | 53.1000 | S | 3 | 1 |
4 | 0 | 3 | 1 | 2 | 0 | 0 | 8.0500 | S | 1 | 0 |
# 查看Families和Survived之间的关系
data_train[['Families', 'Survived']].groupby(by=['Families'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Families | Survived | |
3 | 3 | 0.724138 |
2 | 2 | 0.578431 |
1 | 1 | 0.552795 |
6 | 6 | 0.333333 |
0 | 0 | 0.303538 |
4 | 4 | 0.200000 |
5 | 5 | 0.136364 |
7 | 7 | 0.000000 |
8 | 10 | 0.000000 |
登船家人人数少的存活几率超过50%,3至1人几率在72%至55%,依次递减
登船家人人数适中的存活几率较低,6/4/5人分别在1/3、1/5和3/20左右
登船家人人数为7和10人的人无人存活
单独登船的人有不到1/3几率存活,可以创建一个Alone变量
Alone
# 创建Alone变量字段
for d in data_combined:
d['Alone'] = 0
d.loc[d.Families == 0, 'Alone'] = 1
data_train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | Families | Alone | |
0 | 0 | 3 | 1 | 1 | 1 | 0 | 7.2500 | S | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 2 | 1 | 0 | 71.2833 | C | 3 | 1 | 0 |
2 | 1 | 3 | 0 | 1 | 0 | 0 | 7.9250 | S | 2 | 0 | 1 |
3 | 1 | 1 | 0 | 2 | 1 | 0 | 53.1000 | S | 3 | 1 | 0 |
4 | 0 | 3 | 1 | 2 | 0 | 0 | 8.0500 | S | 1 | 0 | 1 |
# 按照是否无家人陪同分类查看生存几率
data_train[['Alone', 'Survived']].groupby(by=['Alone'], as_index=False).mean()
Alone | Survived | |
0 | 0 | 0.505650 |
1 | 1 | 0.303538 |
drop SibSp, Parch, Families
# 将SibSp,Parch和Families列去除,专门分析Alone列
data_train = data_train.drop(['SibSp', 'Parch', 'Families'], axis=1)
data_test = data_test.drop(['SibSp', 'Parch', 'Families'], axis=1)
data_combined = [data_train, data_test]
data_train.head()
Survived | Pclass | Sex | Age | Fare | Embarked | Title | Alone | |
0 | 0 | 3 | 1 | 1 | 7.2500 | S | 1 | 0 |
1 | 1 | 1 | 0 | 2 | 71.2833 | C | 3 | 0 |
2 | 1 | 3 | 0 | 1 | 7.9250 | S | 2 | 1 |
3 | 1 | 1 | 0 | 2 | 53.1000 | S | 3 | 0 |
4 | 0 | 3 | 1 | 2 | 8.0500 | S | 1 | 1 |
Age x Pclass
# 创建Age*Pclass字段
for d in data_combined:
d['Age*Pclass'] = d.Age * d.Pclass
data_train[['Age', 'Pclass', 'Age*Pclass', 'Survived']].head()
Age | Pclass | Age*Pclass | Survived | |
0 | 1 | 3 | 3 | 0 |
1 | 2 | 1 | 2 | 1 |
2 | 1 | 3 | 3 | 1 |
3 | 2 | 1 | 2 | 1 |
4 | 2 | 3 | 6 | 0 |
Embarked->123
print('训练集中Embarked空值个数为{}'.format(data_train.Embarked.isnull().sum()))
训练集中Embarked空值个数为2
# 将Embarked少量空值填充为众数
for d in data_combined:
d.Embarked = d.Embarked.fillna(data_train.Embarked.dropna().mode()[0])
# 按照登船地点分类查看与存活几率之间的关系
data_train[['Embarked', 'Survived']].groupby(by=['Embarked'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Embarked | Survived | |
0 | C | 0.553571 |
1 | Q | 0.389610 |
2 | S | 0.339009 |
# 将登船地点转化为数值
for d in data_combined:
d.Embarked = d.Embarked.map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
Fare
print('测试集中Fare空值个数为{}'.format(data_test.Fare.isnull().sum()))
测试集中Fare空值个数为1
# 填充少量Fare的缺失值为众数
for d in data_combined:
d.Fare.fillna(d.Fare.dropna().median(), inplace=True)
Band_Fare
# 分别绘制罹难和生还人员票价分布图
fg = sns.FacetGrid(data_train, col='Survived', size=3, aspect=1.5)
fg.map(plt.hist, 'Fare', bins=10)
<seaborn.axisgrid.FacetGrid at 0x1cfddef0>
# pandas.qcut按照样品数量等分,pandas.cut等分样品区间
data_train['Band_Fare'] = pd.qcut(data_train.Fare, 4)
data_train[['Band_Fare', 'Survived']].groupby(by=['Band_Fare'], as_index=False).mean().sort_values(by=['Survived'], ascending=False)
Band_Fare | Survived | |
3 | (31.0, 512.329] | 0.581081 |
2 | (14.454, 31.0] | 0.454955 |
1 | (7.91, 14.454] | 0.303571 |
0 | (-0.001, 7.91] | 0.197309 |
- 区段票价越高,存活率越高
Fare->123
# 将票价归入对应的区间并赋值
for d in data_combined:
d.loc[(d.Fare <= 7.91), 'Fare'] = 0
d.loc[(d.Fare > 7.91) & (d.Fare <= 14.454), 'Fare'] = 1
d.loc[(d.Fare > 14.454) & (d.Fare <= 31.0), 'Fare'] = 2
d.loc[(d.Fare > 31.0), 'Fare'] = 3
d.Fare = d.Fare.astype(int)
drop Band_Fare
data_train = data_train.drop(['Band_Fare'], axis=1)
data_train.head()
Survived | Pclass | Sex | Age | Fare | Embarked | Title | Alone | Age*Pclass | |
0 | 0 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 3 |
1 | 1 | 1 | 0 | 2 | 3 | 1 | 3 | 0 | 2 |
2 | 1 | 3 | 0 | 1 | 1 | 0 | 2 | 1 | 3 |
3 | 1 | 1 | 0 | 2 | 3 | 0 | 3 | 0 | 2 |
4 | 0 | 3 | 1 | 2 | 1 | 0 | 1 | 1 | 6 |
data_test.head()
PassengerId | Pclass | Sex | Age | Fare | Embarked | Title | Alone | Age*Pclass | |
0 | 892 | 3 | 1 | 2 | 0 | 2 | 1 | 1 | 6 |
1 | 893 | 3 | 0 | 2 | 0 | 0 | 3 | 0 | 6 |
2 | 894 | 2 | 1 | 3 | 1 | 2 | 1 | 1 | 6 |
3 | 895 | 3 | 1 | 1 | 1 | 0 | 1 | 1 | 3 |
4 | 896 | 3 | 0 | 1 | 1 | 0 | 3 | 0 | 3 |
各变量热力图
# 简单观察各个变量相关性
colormap = plt.cm.jet
plt.figure(figsize=(10, 10))
sns.heatmap(data_train.astype(float).corr(), linewidth=0.1, vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1e359da0>
预测模型
已知训练集包含生存结果,需要将测试集结果归类,是监督学习下进行回归分析
考虑以下数据集:
+ 逻辑回归(Logistic Regression)
+ k近邻算法(k-Nearest Neighbors)
+ 支持向量机(Support Vector Machines)
+ 朴素贝叶斯分类器(Naive Bayes classifier)
+ 决策树(Decision Tree)
+ 随机森林(Random Forrest)
+ 感知机(Perceptron)
+ 线性支持向量机(Linear SVC)
+ 随机梯度下降法(Stochastic Gradient Descent)
+ 人工神经网络(Artificial neural network)
+ 相关向量机(Relevance Vector Machine)
x_train = data_train.drop(['Survived'], axis=1)
y_train = data_train.Survived
x_test = data_test.drop(['PassengerId'], axis=1)
x_train.shape, y_train.shape, x_test.shape
((891, 8), (891,), (418, 8))
逻辑回归
逻辑回归适用于因变量为分类值(本例中为0或1用以代表是否存活),用以估计的概率衡量因变量和一或多个自变量之间的关系
# 实例化逻辑回归类对象
lgrg = LogisticRegression()
# 以训练集训练模型
lgrg.fit(x_train, y_train)
# 以测试集测试模型
y_predict = lgrg.predict(x_test)
# 估计一下模型精度
acc_lgrg = round(lgrg.score(x_train, y_train) * 100, 2)
print('Logistic Regression:\n\tAccuracy:\t{}'.format(acc_lgrg))
Logistic Regression:
Accuracy: 80.92
corr_ftr = pd.DataFrame(x_train.columns)
corr_ftr.columns = ['Features']
corr_ftr['Correlations'] = pd.Series(lgrg.coef_[0])
corr_ftr.sort_values(by=['Correlations'], ascending=False)
Features | Correlations | |
5 | Title | 0.440642 |
6 | Alone | 0.377805 |
4 | Embarked | 0.292706 |
3 | Fare | 0.061489 |
7 | Age*Pclass | -0.138891 |
2 | Age | -0.217198 |
0 | Pclass | -0.885901 |
1 | Sex | -2.119842 |
- 负相关中相关性最大为Sex(male=1, female=0),这与最初观察相符,男性存活率比女性低;Pclass也类似
- 正相关中Title、Alone和Embarked的相关度较高
支持向量机
支持向量机是与相关的学习算法有关的监督学习模型,可以分析数据,识别模式,用于分类和回归分析。给定一组训练样本,每个标记为属于两类,一个SVM训练算法建立了一个模型,分配新的实例为一类或其他类,使其成为非概率二元线性分类。
svc = SVC()
svc.fit(x_train, y_train)
y_predict = svc.predict(x_test)
acc_svc = round(svc.score(x_train, y_train) * 100, 2)
print('Support Vector Machines:\n\tAccuracy:\t{}'.format(acc_svc))
Support Vector Machines:
Accuracy: 83.5
k近邻算法
k临近算法是一个非参数分类和回归方法,这个模型用于预测距离特征空间最近的因变量值
# 寻找最适合的n_neighbors值
range_k = range(3, 9)
list_acc = []
for k in range_k:
knc = KNeighborsClassifier(n_neighbors = k)
knc.fit(x_train, y_train)
y_predict = knc.predict(x_test)
list_acc.append(round(knc.score(x_train, y_train) * 100, 2))
plt.plot(range_k, list_acc)
plt.xlabel('k for KNN')
plt.ylabel('Accuracy')
Text(0,0.5,'Accuracy')
- 最适合的k值为3
print('k-Nearest Neighbors:\n\tAccuracy:\t{}'.format(list_acc[0]))
k-Nearest Neighbors:
Accuracy: 84.06
朴素贝叶斯分类器
朴素贝叶斯分类器假定样本每个特征与其他特征都不相关,具有高度可伸缩且适用于大数据集
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_predict = gnb.predict(x_test)
acc_gnb = round(gnb.score(x_train, y_train) * 100, 2)
print('Naive Bayes classifier:\n\tAccuracy:\t{}'.format(acc_gnb))
Naive Bayes classifier:
Accuracy: 76.88
感知机
pct = Perceptron()
pct.fit(x_train, y_train)
y_predict = pct.predict(x_test)
acc_pct = round(pct.score(x_train, y_train) * 100, 2)
print('Perceptron:\n\tAccuracy:\t{}'.format(acc_pct))
Perceptron:
Accuracy: 78.45
D:\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.perceptron.Perceptron'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
线性支持向量机
线性支持向量机通常最适于文本分类问题,是支持向量机分类器在线性核函数上的应用。
lsvc = LinearSVC()
lsvc.fit(x_train, y_train)
y_predict = lsvc.predict(x_test)
acc_lsvc = round(lsvc.score(x_train, y_train) * 100, 2)
print('Linear Supported Vector classifier:\n\tAccuracy:\t{}'.format(acc_lsvc))
Linear Supported Vector classifier:
Accuracy: 79.57
随机梯度下降法
随机梯度下降法在损失函数梯度方向迭代更新权重参数直至取得最小值.不同于传统的梯度下降法,不使用整个数据集计算每次迭代的梯度;相反随机选择单个数据集中的数据点,并沿着该点对应的梯度方向移动。
sgdc = SGDClassifier()
sgdc.fit(x_train, y_train)
y_predict = sgdc.predict(x_test)
acc_sgdc = round(sgdc.score(x_train, y_train) * 100, 2)
print('Stochastic Gradient Descent:\n\tAccuracy:\t{}'.format(acc_sgdc))
Stochastic Gradient Descent:
Accuracy: 80.36
D:\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
决策树
决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
y_predict = dtc.predict(x_test)
acc_dtc = round(dtc.score(x_train, y_train) * 100, 2)
print('Decision Tree:\n\tAccuracy:\t{}'.format(acc_dtc))
Decision Tree:
Accuracy: 86.64
随机森林
随机森林是用于分类和回归等的集成学习方法,它包含多个决策树的分类器,其输出的类别是由个别树输出的类别的众数而定。
rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(x_train, y_train)
y_predict = rfc.predict(x_test)
acc_rfc = round(rfc.score(x_train, y_train) * 100, 2)
print('Random Forest classifier:\n\tAccuracy:\t{}'.format(acc_rfc))
Random Forest classifier:
Accuracy: 86.64
# 获得特征重要性
importances_feature = rfc.feature_importances_
# 排序
index_sorted = np.argsort(importances_feature)
positions = np.arange(index_sorted.shape[0])
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.figure(figsize=(8, 6))
plt.barh(positions, importances_feature[index_sorted], align='center')
plt.yticks(positions, x_train.columns[index_sorted])
plt.title('变量重要性')
plt.show()
汇总模型预测效果
acc_model = pd.DataFrame({'model': ['Logistic Regression', 'k-Nearest Neighbors', 'Support Vector Machines',\
'Naive Bayes classifier', 'Decision Tree', 'Random Forrest', 'Perceptron', 'Linear SVC', \
'Stochastic Gradient Descent'], \
'acc': [acc_lgrg, list_acc[0], acc_svc, acc_gnb, acc_dtc, acc_rfc, acc_pct, acc_lsvc, acc_sgdc]})
acc_model.sort_values(by=['acc'], ascending=False)
acc | model | |
4 | 86.64 | Decision Tree |
5 | 86.64 | Random Forrest |
1 | 84.06 | k-Nearest Neighbors |
2 | 83.50 | Support Vector Machines |
0 | 80.92 | Logistic Regression |
7 | 79.57 | Linear SVC |
6 | 78.45 | Perceptron |
8 | 77.10 | Stochastic Gradient Descent |
3 | 76.88 | Naive Bayes classifier |
index_sorted = np.argsort(acc_model.acc)
positions = np.arange(index_sorted.shape[0])
plt.figure(figsize=(8, 5))
plt.barh(positions, acc_model.acc[index_sorted], align='center')
plt.yticks(positions, acc_model.model[index_sorted])
plt.title('各模型准确度')
plt.show()
- 决策树和随机森林精确度相同,但是决策树有过拟合训练数据集的倾向,随机森林更优
留待解决的问题
- Ticket,Cabin,Name长度和首字母的影响未列入
- 某些分析还可以以更加直观的图形表现(如在同一张图堆叠绘制生还和罹难人员人数随年龄变化的曲线或分布图,可以比较各年龄段生还比例)
Kaggle上可参考的Solution链接
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-pythonhttps://www.kaggle.com/startupsci/titanic-data-science-solutions