参考:

  1. https://github.com/apachecn/kaggle/tree/master/competitions/getting-started/titanic
  2. https://www.kaggle.com/goldens/classification-81-3-with-simple-model-nested-cv

1. 问题描述

  • 泰坦尼克号的沉没是历史上最臭名昭著的沉船事件之一。1912年4月15日,泰坦尼克号在处女航中撞上冰山沉没,2224名乘客和船员中1502人遇难。这一耸人听闻的悲剧震惊了国际社会,并导致了更好的船舶安全条例。
  • 沉船造成如此巨大人员伤亡的原因之一是没有足够的救生艇来容纳乘客和船员。虽然在沉船事件中幸存下来也有一些运气的因素,但有些人比其他人更有可能幸存下来,比如妇女、儿童和上层阶级。
  • 在这个挑战中,我们要求你完成对可能存活下来的人的分析。我们特别要求你们运用机器学习工具来预测哪些乘客在灾难中幸存下来。
    【注】
  • 数据来源:https://www.kaggle.com/c/titanic/data
  • 问题定位:二分类问题

2. 主要分析过程

  • 探索性数据分析
  • 特征变量选取
  • 模型选择

3. 探索性数据分析

3.1 变量说明

变量

含义

变量

含义

Survival

生存

Pclass

票类别

Sex

性别

Age

年龄

Sibsp

兄弟姐妹/配偶的数量

Parch

父母/孩子的数量

Ticket

票号

Fare

票价

Cabin

客舱号码

Embarked

登船港口

3.2 读取数据

# 加载包
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings

# 读取数据
train_data = pd.read_csv('train_titanic.csv')
test_data = pd.read_csv('test_titanic.csv')
titanic = pd.concat([train_data, test_data], sort=False)

3.3 数据预览

# 数据探索
print(train_data.info())  
print('-'*30)
print(train_data.head())

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
------------------------------
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

可以看到,数据中有缺失值, 为确定数据缺失的情况,输入如下:

print(titanic.isnull().sum()[titanic.isnull().sum() > 0])

输出:

Survived     418
Age          263
Fare           1
Cabin       1014
Embarked       2
dtype: int64

由于变量 Cabin 缺失较多,这里选择不做处理,直接放弃。对变量 Fare 的缺失值取其均值作为替换,对变量 Embarked 的缺失值取其众数作为替换,对变量 Age 的缺失值按照 Name 和 Sex 的分组 去均值作为替换。

# Fare
train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)
# Embarked 为登陆港口,有少量的缺失值,可以先观察 Embarked 的取值情况
print(train_data['Embarked'].value_counts())
# 发现S港口的人较多,因此将缺失值都设为'S'
train_data['Embarked'].fillna('S', inplace=True)
test_data['Embarked'].fillna('S', inplace=True)
#  Cabin 为船舱,有大量的缺失值,无法补齐
# Age
train_data['title'] = train_data.Name.apply(
    lambda x: x.split('.')[0].split(',')[1].strip())
test_data['title'] = test_data.Name.apply(
    lambda x: x.split('.')[0].split(',')[1].strip())
newtitles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir":       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess": "Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr":        "Mr",
    "Mrs":       "Mrs",
    "Miss":      "Miss",
    "Master":    "Master",
    "Lady":      "Royalty"}
train_data['title'] = train_data.title.map(newtitles)
test_data['title'] = test_data.title.map(newtitles)
train_data.groupby(['title', 'Sex']).Age.mean()


def newage(cols):
    title = cols[0]
    Sex = cols[1]
    Age = cols[2]
    if pd.isnull(Age):
        if title == 'Master' and Sex == "male":
            return 4.57
        elif title == 'Miss' and Sex == 'female':
            return 21.8
        elif title == 'Mr' and Sex == 'male':
            return 32.37
        elif title == 'Mrs' and Sex == 'female':
            return 35.72
        elif title == 'Officer' and Sex == 'female':
            return 49
        elif title == 'Officer' and Sex == 'male':
            return 46.56
        elif title == 'Royalty' and Sex == 'female':
            return 40.50
        else:
            return 42.33
    else:
        return Age


train_data.Age = train_data[['title', 'Sex', 'Age']].apply(newage, axis=1)
test_data.Age = test_data[['title', 'Sex', 'Age']].apply(newage, axis=1)

3.4 变量分析

# 变量相关分析
warnings.filterwarnings(action="ignore")
plt.figure(figsize=[10, 30])
plt.subplot(3, 3, 1)
sns.barplot('Pclass', 'Survived', data=train_data)
plt.subplot(3, 3, 2)
sns.barplot('SibSp', 'Survived', data=train_data)
plt.subplot(3, 3, 3)
sns.barplot('Parch', 'Survived', data=train_data)
plt.subplot(3, 3, 4)
sns.barplot('Sex', 'Survived', data=train_data)
plt.subplot(3, 3, 5)
sns.barplot('Ticket', 'Survived', data=train_data)
plt.subplot(3, 3, 6)
sns.barplot('Embarked', 'Survived', data=train_data)
plt.subplot(3, 3, 7)
sns.distplot(train_data[train_data.Survived == 1].Age,
             color='green', kde=False)
sns.distplot(train_data[train_data.Survived == 0].Age,
             color='orange', kde=False)
plt.subplot(3, 3, 8)
sns.distplot(train_data[train_data.Survived == 1].Fare,
             color='green', kde=False)
sns.distplot(train_data[train_data.Survived == 0].Fare,
             color='orange', kde=False)
# 由于变量 Sibsp 和变量 Parch 与 Survival 的相关性不明显,故考虑将这两个变量合并
train_data['Relatives'] = train_data.SibSp+train_data.Parch
test_data['Relatives'] = test_data.SibSp+test_data.Parch
plt.subplot(3, 3, 9)
sns.distplot(train_data[train_data.Survived ==
                        1].Relatives, color='green', kde=False)
sns.distplot(train_data[train_data.Survived ==
                        0].Relatives, color='orange', kde=False)
plt.show()

输出:

泰坦尼克幸存者r语言研究 泰坦尼克幸存者多少人_泰坦尼克

4. 特征选择

# 特征选择
features = ['Pclass', 'Sex', 'Age', 'Relatives', 'Fare', 'Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
test_features = test_data[features]

# 将分类变量转换成数值类型
dvec = DictVectorizer(sparse=False)
train_features = dvec.fit_transform(
    train_features.to_dict(orient='record'))  # 将特征向量转化为特征值矩阵
print(dvec.feature_names_)
test_features = dvec.transform(test_features.to_dict(orient='record'))

5. 模型选取

这里主要考虑两种方法:决策树和随机森林。

5.1 决策树

# 决策树模型
# 构造 ID3 决策树
clf = DecisionTreeClassifier(criterion='entropy')
# 决策树训练
clf.fit(train_features, train_labels)


# 用训练集的数据做评估
# 得到决策树准确率
acc_decision_tree = round(
    clf.score(train_features, train_labels), 6)
print(u'score 准确率为 %.4lf' % acc_decision_tree)  # %4lf为宽度为4的double型双精度

# 使用 K 折交叉验证决策树准确率
# 使用 K 折交叉验证 统计决策树准确率
print(u'cross_val_score 准确率为 %.4lf' %
      np.mean(cross_val_score(clf, train_features, train_labels, cv=10)))

输出:

['Age', 'Embarked=C', 'Embarked=Q', 'Embarked=S', 'Fare', 'Pclass', 'Relatives', 'Sex=female', 'Sex=male']
score 准确率为 0.9820
cross_val_score 准确率为 0.7869

5.2 随机森林

# 随机森林
rf = RandomForestClassifier(
    n_estimators=150, min_samples_leaf=2, max_depth=6, oob_score=True)
rf.fit(train_features, train_labels)
print(u'cross_val_score 准确率为 %.4lf' %
      np.mean(cross_val_score(rf, train_features, train_labels, cv=10)))

输出:

cross_val_score 准确率为 0.833

经比较,随机森林的预测效果要优于决策树法。

# 预测
RF_lables = rf.predict(test_features)