Logistic回归预测Titanic
读取数据
import pandas as pd
import keras
from keras import layers
import numpy as np
Using TensorFlow backend.
data = pd.read_csv("./data/tt_train.csv")
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
分析
PassengerId肯定不影响,去掉;Name没什么用,去掉;Ticket是票的id号也肯定不影响,取消;Cabin是船舱编码,虽然有影响,但是缺失值太多(从data.info就能看到),所以也去掉;Embarked是船舱位置,也会影响是否获救。
数据预处理
去掉不要的特征、把非数值特征数值化、处理缺失值。
# 查看所有的列索引名,方便去复制要用的列
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
现在要取有用的列。x这个df用来保存特征,但这里先带着预测值Survived这一列,因为一会要做数据预处理,万一某些行被删掉了呢?等到预处理完了再把y取出来,并在x中把预测值这列删掉!
x = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
'Parch', 'Fare', 'Embarked']]
x.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
# 现在要处理Embarked(船舱位置),先看一下它有多少种取值
x.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)
看出来了有3种取值(缺失值不算一种的话),这里用3个维度的one-hot编码。
这里要关注一下变换成one-hot编码的技巧。
# 原来的每一类,对所有样本都可以得到True/False
(x.Embarked=='S').head() # 太长了只看前5个
0 True
1 False
2 True
3 True
4 True
Name: Embarked, dtype: bool
# 转换成1/0
(x.Embarked=='S').astype('int').head()
0 1
1 0
2 1
3 1
4 1
Name: Embarked, dtype: int32
# 添加列之前copy一下,不然有警告
x=x.copy()
# 用上面的方式添加one-hot编码,因为是3维的所以要添加三列特征
x.loc[:,'Embarked_S']=(x.Embarked=='S').astype('int')
x.loc[:,'Embarked_C']=(x.Embarked=='C').astype('int')
x.loc[:,'Embarked_Q']=(x.Embarked=='Q').astype('int')
x.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Embarked_S | Embarked_C | Embarked_Q | |
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | 1 | 0 | 0 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | 0 | 1 | 0 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | 1 | 0 | 0 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | 1 | 0 | 0 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | 1 | 0 | 0 |
注意,在这种one-hot编码下,缺失值NaN显然是[0,0,0]
# 删除原始的Embarked列
del x['Embarked']
x.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked_S | Embarked_C | Embarked_Q | |
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 0 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | 0 | 1 | 0 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 0 |
还剩Sex这一列没有数值化,这一列也要进行one-hot编码,这里不妨学习另一种方式:直接用pd.get_dummies()
这个函数,它会将传入的dataframe的非数值列都进行one-hot编码。
x=pd.get_dummies(x)
x.head()
Survived | Pclass | Age | SibSp | Parch | Fare | Embarked_S | Embarked_C | Embarked_Q | Sex_female | Sex_male | |
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 1 | 0 | 1 | 0 |
2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 1 | 0 |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 1 | 0 |
4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 0 | 0 | 1 |
思考一下,什么情况下两种取值的字段不用one-hot编码就能数值化?当不存在缺失值的时候,现在这个样本集x的性别确实就不存在缺失值,是可以直接转换成0/1,但是能保证验证集、测试集和以后的样本都不存在缺失数据吗?(Titanic数据集确实可以保证,因为这件事已经过去了,样本不会在增加了,但这是因为这个数据集"背景"比较特殊)
x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked_S 891 non-null int32
Embarked_C 891 non-null int32
Embarked_Q 891 non-null int32
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB
# 处理Age的缺失值:用均值填充
x['Age'] = x.Age.fillna(x.Age.mean())
x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked_S 891 non-null int32
Embarked_C 891 non-null int32
Embarked_Q 891 non-null int32
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB
x.head()
Survived | Pclass | Age | SibSp | Parch | Fare | Embarked_S | Embarked_C | Embarked_Q | Sex_female | Sex_male | |
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 1 | 0 | 1 | 0 |
2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 1 | 0 |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 1 | 0 |
4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 0 | 0 | 1 |
现在看到Pclass这一列,这一列表示票的等级,它虽然值是数值的,但实际上并没有数值间的那种倍比、加减关系(试想如何向计算机阐释"三等票不是一等票的三倍,虽然它数值上是三倍关系"),本质上它还是一个"类别特征",所以这里对它也进行one-hot编码。
x.loc[:,'P1'] = (x.Pclass==1).astype('int')
x.loc[:,'P2'] = (x.Pclass==2).astype('int')
x.loc[:,'P3'] = (x.Pclass==3).astype('int')
del x['Pclass']
x.head()
Survived | Age | SibSp | Parch | Fare | Embarked_S | Embarked_C | Embarked_Q | Sex_female | Sex_male | P1 | P2 | P3 | |
0 | 0 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 1 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 0 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
# 现在预处理完成了,把预测值取出来,并在x中把它删掉
y = data.Survived
del x['Survived']
x.shape, y.shape
((891, 12), (891,))
建立和训练模型
# 顺序模型
model = keras.Sequential()
# 全连接层,输出1维,输入12维度,使用Sigmoid作为激活函数
model.add(layers.Dense(1, input_dim=12, activation='sigmoid'))
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 13
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________
# 编译模型
model.compile(
optimizer='adam',
loss='binary_crossentropy', # 这里用二元的交叉熵作为二分类的损失函数
metrics=['acc'] # 在训练时输出accuracy(精度,即正确率)
)
# 训练模型,从返回值可以获得其训练过程中的一些信息
history = model.fit(x, y, epochs=300, verbose=0) # verbose=0不从std输出,不然导出markdown这块就太长了
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
# 查看保留了哪些数据(该对象的history属性就是一个字典,keys()就是取字典的键)
history.history.keys()
dict_keys(['loss', 'acc'])
绘制训练过程
在字典中保留了loss和acc的变化情况(不是只保留最终值),现在把它们绘制出来看一下。
import matplotlib.pyplot as plt
%matplotlib inline
绘制loss的变化过程:
# 这里300==len(history.history.get('loss')==len(history.history.get('acc')==epochs
plt.plot(range(300),history.history.get('loss'))
[<matplotlib.lines.Line2D at 0x151c0eb8>]
绘制acc的变化过程:
plt.plot(range(300),history.history.get('acc'))
[<matplotlib.lines.Line2D at 0x1621a5f8>]
导出预测值以提交到Kaggle
# 读取测试集,并做相同的预处理
df = pd.read_csv("./data/tt_test.csv")
df.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
# 注意这里根本没有预测值(Survived列),所以就不用考虑跟着取出来它了
xt = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
xt = xt.copy()
xt.loc[:,'Embarked_S']=(xt.Embarked=='S').astype('int')
xt.loc[:,'Embarked_C']=(xt.Embarked=='C').astype('int')
xt.loc[:,'Embarked_Q']=(xt.Embarked=='Q').astype('int')
del xt['Embarked']
xt = pd.get_dummies(xt)
xt['Age'] = xt.Age.fillna(xt.Age.mean())
xt.loc[:,'P1'] = (xt.Pclass==1).astype('int')
xt.loc[:,'P2'] = (xt.Pclass==2).astype('int')
xt.loc[:,'P3'] = (xt.Pclass==3).astype('int')
del xt['Pclass']
xt.head()
Age | SibSp | Parch | Fare | Embarked_S | Embarked_C | Embarked_Q | Sex_female | Sex_male | P1 | P2 | P3 | |
0 | 34.5 | 0 | 0 | 7.8292 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
1 | 47.0 | 1 | 0 | 7.0000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 62.0 | 0 | 0 | 9.6875 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
3 | 27.0 | 0 | 0 | 8.6625 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 22.0 | 1 | 1 | 12.2875 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
# 计算预测值
predictions = model.predict(xt)
type(predictions)
numpy.ndarray
# 生成提交csv
submission = pd.DataFrame({"PassengerId": df["PassengerId"], "Survived": (predictions.flatten()>0.5).astype('int')})
submission.to_csv("./data/tt_upload.csv", index=False)
E:\MyProgram\Anaconda\envs\krs\lib\site-packages\ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in greater