python泰坦尼克号数据集下载 python泰坦尼克号数据预测

转载

mob6454cc762e37 2023-09-15 16:03:15

文章标签 python泰坦尼克号数据集下载可视化 python 数据分析字段 文章分类 Python 后端开发

文章目录

泰坦尼克号旅客生存预测

1. 数据集

1.1 获取
1.2 数据展示及主要字段说明

2. 数据预处理

2.1 读入数据
2.2 查看数据摘要
2.3 筛选提取字段
2.4 存在的问题及解决方案
2.5 找出有 null 值的字段
2.6 填充 null 值
2.7 转换编码
2.8 删除 name 字段
2.9 打乱数据顺序
2.10 分离特征值和标签值
2.11 特征值标准化处理
2.12 完整的数据预处理函数

3. 模型建立及应用

3.1 划分训练集和测试集
3.2 建立多层神经网络模型

3.2.1 模型结构
3.2.2 模型设置
3.2.3 模型训练
3.2.4 模型训练过程可视化
3.2.5 模型评估

3.3 模型应用

3.3.1 加入预测数据
3.3.2 进行预测
3.3.3 查看预测结果

其他文章

泰坦尼克号旅客生存预测

1. 数据集

1.1 获取

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

1.2 数据展示及主要字段说明

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_02

2. 数据预处理

使用 pandas 进行数据预处理

2.1 读入数据

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_03

2.2 查看数据摘要

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_04

2.3 筛选提取字段

Excel 中并不是所有的字段在建模时都需要用到，只筛选提取出需要的特征字段，去掉 ticket、carbin 等

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_05

2.4 存在的问题及解决方案

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_06

2.5 找出有 null 值的字段

利用 isnull() 方法，进行元素级别的判断，生成所有数据的 bool 矩阵，元素为 null 或 NA 就显示为 True

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_07

继续使用 any() 方法，可以进行列级别的判断，只要该列有为 null 或 NA 的元素，就为 True

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_08

如果想要进一步详细的知道列中为空的元素的个数，可以通过 sum() 方法进行统计

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python_09

筛选出存在缺失值的记录

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_10

2.6 填充 null 值

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_11

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_12

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_13

2.7 转换编码

文本信息转化为数字编码

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_14

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_可视化_15

查看数据

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_可视化_16

2.8 删除 name 字段

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python_17

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_18

2.9 打乱数据顺序

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_19

2.10 分离特征值和标签值

survived 为标签值，其他字段为特征值

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_20

特征值

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_21

标签值

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_22

2.11 特征值标准化处理

使用 sklearn 将特征值映射到 (0, 1) 之间

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_可视化_23

标准化之前

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python泰坦尼克号数据集下载_21

标准化之后

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_25

2.12 完整的数据预处理函数

对以上代码进行整合，封装为函数

# sklearn.preprocessing   专门进行预处理
from sklearn import preprocessing

def prepare_data(selected_df_data):
    # fillna() 方法只会填充 null 值
    # age 字段填充平均值
    age_mean_value = selected_df_data['age'].mean()
    selected_df_data['age'] = selected_df_data['age'].fillna(age_mean_value)

    # fare 字段填充平均值
    fare_mean_value = selected_df_data['fare'].mean()
    selected_df_data['fare'] = selected_df_data['fare'].fillna(fare_mean_value)

    # embarked 字段填充 S
    selected_df_data['embarked'] = selected_df_data['embarked'].fillna('S')

    # sex 字段字符串转换为数字
    # astype() 方法用于进行字段类型转换
    selected_df_data['sex'] = selected_df_data['sex'].map({'female': 0, 'male': 1}).astype(int)

    # embarked 字段字符串转换为数字
    selected_df_data['embarked'] = selected_df_data['embarked'].map({'C': 0, 'Q': 1, 'S': 2}).astype(int)

    # drop() 方法会返回一个新的 DataFrame
    # axis=1 表示删除列
    selected_df_data = selected_df_data.drop(['name'], axis=1)

    # 打乱数据顺序，为后面训练做准备
    # sample 抽样函数，frac 为百分比，根据 frac 的值随机抽出相应百分比的数据
    shuffled_df_data = selected_df_data.sample(frac=1)

    # 转换为 ndarrary 数组
    ndarrary_data = shuffled_df_data.values

    # 提取特征值
    features = ndarrary_data[:, 1:]

    # 提取标签值
    label = ndarrary_data[:, 0]

    # 特征值标准化
    minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

    # features 加 norm 前缀表示已经标准化过的
    norm_features = minmax_scale.fit_transform(features)

    return norm_features, label

3. 模型建立及应用

3.1 划分训练集和测试集

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_26

3.2 建立多层神经网络模型

3.2.1 模型结构

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_字段_27

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_28

添加层

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_可视化_29

3.2.2 模型设置

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_python_30

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_31

API 参考文档https://www.tensorflow.org/versions/r1.10/api_docs/python/tf/keras/Model

3.2.3 模型训练

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_数据分析_32

train_history 共有 4 个条目，每个条目对应训练和验证期间的一个受监控指标，可以使用这些指标绘制训练损失与验证损失图表和训练准确率与验证准确率图表

3.2.4 模型训练过程可视化

定义可视化函数

import matplotlib.pyplot as plt

# train_history 训练历史对象
# train_metric      训练度量，可选值，'acc', 'loss'
# validation_metric    验证度量，可选值，'val_acc', 'val_loss'
def visu_train_history(train_history, train_metric, validation_metric):
    plt.plot(train_history.history[train_metric])
    plt.plot(train_history.history[validation_metric])
    plt.title('Train History')
    plt.ylabel(train_metric)
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper right')
    plt.show()

查看训练准确率与验证准确率

python泰坦尼克号数据集下载 python泰坦尼克号数据预测_可视化_33