Logistic回归预测Titanic

读取数据
import pandas as pd
import keras
from keras import layers
import numpy as np
Using TensorFlow backend.
data = pd.read_csv("./data/tt_train.csv")
data.head()



PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
分析

PassengerId肯定不影响,去掉;Name没什么用,去掉;Ticket是票的id号也肯定不影响,取消;Cabin是船舱编码,虽然有影响,但是缺失值太多(从data.info就能看到),所以也去掉;Embarked是船舱位置,也会影响是否获救。

数据预处理

去掉不要的特征、把非数值特征数值化、处理缺失值。

# 查看所有的列索引名,方便去复制要用的列
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

现在要取有用的列。x这个df用来保存特征,但这里先带着预测值Survived这一列,因为一会要做数据预处理,万一某些行被删掉了呢?等到预处理完了再把y取出来,并在x中把预测值这列删掉!

x = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked']]
x.head()



Survived

Pclass

Sex

Age

SibSp

Parch

Fare

Embarked

0

0

3

male

22.0

1

0

7.2500

S

1

1

1

female

38.0

1

0

71.2833

C

2

1

3

female

26.0

0

0

7.9250

S

3

1

1

female

35.0

1

0

53.1000

S

4

0

3

male

35.0

0

0

8.0500

S

# 现在要处理Embarked(船舱位置),先看一下它有多少种取值
x.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)

看出来了有3种取值(缺失值不算一种的话),这里用3个维度的one-hot编码。
这里要关注一下变换成one-hot编码的技巧。

# 原来的每一类,对所有样本都可以得到True/False
(x.Embarked=='S').head() # 太长了只看前5个
0     True
1    False
2     True
3     True
4     True
Name: Embarked, dtype: bool
# 转换成1/0
(x.Embarked=='S').astype('int').head()
0    1
1    0
2    1
3    1
4    1
Name: Embarked, dtype: int32
# 添加列之前copy一下,不然有警告
x=x.copy()
# 用上面的方式添加one-hot编码,因为是3维的所以要添加三列特征
x.loc[:,'Embarked_S']=(x.Embarked=='S').astype('int')
x.loc[:,'Embarked_C']=(x.Embarked=='C').astype('int')
x.loc[:,'Embarked_Q']=(x.Embarked=='Q').astype('int')
x.head()



Survived

Pclass

Sex

Age

SibSp

Parch

Fare

Embarked

Embarked_S

Embarked_C

Embarked_Q

0

0

3

male

22.0

1

0

7.2500

S

1

0

0

1

1

1

female

38.0

1

0

71.2833

C

0

1

0

2

1

3

female

26.0

0

0

7.9250

S

1

0

0

3

1

1

female

35.0

1

0

53.1000

S

1

0

0

4

0

3

male

35.0

0

0

8.0500

S

1

0

0

注意,在这种one-hot编码下,缺失值NaN显然是[0,0,0]

# 删除原始的Embarked列
del x['Embarked']
x.head()



Survived

Pclass

Sex

Age

SibSp

Parch

Fare

Embarked_S

Embarked_C

Embarked_Q

0

0

3

male

22.0

1

0

7.2500

1

0

0

1

1

1

female

38.0

1

0

71.2833

0

1

0

2

1

3

female

26.0

0

0

7.9250

1

0

0

3

1

1

female

35.0

1

0

53.1000

1

0

0

4

0

3

male

35.0

0

0

8.0500

1

0

0

还剩Sex这一列没有数值化,这一列也要进行one-hot编码,这里不妨学习另一种方式:直接用pd.get_dummies()这个函数,它会将传入的dataframe的非数值列都进行one-hot编码。

x=pd.get_dummies(x)
x.head()



Survived

Pclass

Age

SibSp

Parch

Fare

Embarked_S

Embarked_C

Embarked_Q

Sex_female

Sex_male

0

0

3

22.0

1

0

7.2500

1

0

0

0

1

1

1

1

38.0

1

0

71.2833

0

1

0

1

0

2

1

3

26.0

0

0

7.9250

1

0

0

1

0

3

1

1

35.0

1

0

53.1000

1

0

0

1

0

4

0

3

35.0

0

0

8.0500

1

0

0

0

1

思考一下,什么情况下两种取值的字段不用one-hot编码就能数值化?当不存在缺失值的时候,现在这个样本集x的性别确实就不存在缺失值,是可以直接转换成0/1,但是能保证验证集、测试集和以后的样本都不存在缺失数据吗?(Titanic数据集确实可以保证,因为这件事已经过去了,样本不会在增加了,但这是因为这个数据集"背景"比较特殊)

x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Age           714 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_S    891 non-null int32
Embarked_C    891 non-null int32
Embarked_Q    891 non-null int32
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB
# 处理Age的缺失值:用均值填充
x['Age'] = x.Age.fillna(x.Age.mean())
x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_S    891 non-null int32
Embarked_C    891 non-null int32
Embarked_Q    891 non-null int32
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB
x.head()



Survived

Pclass

Age

SibSp

Parch

Fare

Embarked_S

Embarked_C

Embarked_Q

Sex_female

Sex_male

0

0

3

22.0

1

0

7.2500

1

0

0

0

1

1

1

1

38.0

1

0

71.2833

0

1

0

1

0

2

1

3

26.0

0

0

7.9250

1

0

0

1

0

3

1

1

35.0

1

0

53.1000

1

0

0

1

0

4

0

3

35.0

0

0

8.0500

1

0

0

0

1

现在看到Pclass这一列,这一列表示票的等级,它虽然值是数值的,但实际上并没有数值间的那种倍比、加减关系(试想如何向计算机阐释"三等票不是一等票的三倍,虽然它数值上是三倍关系"),本质上它还是一个"类别特征",所以这里对它也进行one-hot编码。

x.loc[:,'P1'] = (x.Pclass==1).astype('int') 
x.loc[:,'P2'] = (x.Pclass==2).astype('int') 
x.loc[:,'P3'] = (x.Pclass==3).astype('int') 
del x['Pclass']
x.head()



Survived

Age

SibSp

Parch

Fare

Embarked_S

Embarked_C

Embarked_Q

Sex_female

Sex_male

P1

P2

P3

0

0

22.0

1

0

7.2500

1

0

0

0

1

0

0

1

1

1

38.0

1

0

71.2833

0

1

0

1

0

1

0

0

2

1

26.0

0

0

7.9250

1

0

0

1

0

0

0

1

3

1

35.0

1

0

53.1000

1

0

0

1

0

1

0

0

4

0

35.0

0

0

8.0500

1

0

0

0

1

0

0

1

# 现在预处理完成了,把预测值取出来,并在x中把它删掉
y = data.Survived
del x['Survived']
x.shape, y.shape
((891, 12), (891,))
建立和训练模型
# 顺序模型
model = keras.Sequential()
#  全连接层,输出1维,输入12维度,使用Sigmoid作为激活函数
model.add(layers.Dense(1, input_dim=12, activation='sigmoid'))
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1)                 13        
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________
# 编译模型
model.compile(
    optimizer='adam',
    loss='binary_crossentropy', # 这里用二元的交叉熵作为二分类的损失函数
    metrics=['acc'] # 在训练时输出accuracy(精度,即正确率)
)
# 训练模型,从返回值可以获得其训练过程中的一些信息
history = model.fit(x, y, epochs=300, verbose=0) # verbose=0不从std输出,不然导出markdown这块就太长了
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
# 查看保留了哪些数据(该对象的history属性就是一个字典,keys()就是取字典的键)
history.history.keys()
dict_keys(['loss', 'acc'])
绘制训练过程

在字典中保留了loss和acc的变化情况(不是只保留最终值),现在把它们绘制出来看一下。

import matplotlib.pyplot as plt
%matplotlib inline

绘制loss的变化过程:

# 这里300==len(history.history.get('loss')==len(history.history.get('acc')==epochs
plt.plot(range(300),history.history.get('loss'))
[<matplotlib.lines.Line2D at 0x151c0eb8>]

logistic回归 临床预测模型 R logistic回归分析预测_Keras

绘制acc的变化过程:

plt.plot(range(300),history.history.get('acc'))
[<matplotlib.lines.Line2D at 0x1621a5f8>]

logistic回归 临床预测模型 R logistic回归分析预测_缺失值_02

导出预测值以提交到Kaggle
# 读取测试集,并做相同的预处理
df = pd.read_csv("./data/tt_test.csv")
df.head()



PassengerId

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

892

3

Kelly, Mr. James

male

34.5

0

0

330911

7.8292

NaN

Q

1

893

3

Wilkes, Mrs. James (Ellen Needs)

female

47.0

1

0

363272

7.0000

NaN

S

2

894

2

Myles, Mr. Thomas Francis

male

62.0

0

0

240276

9.6875

NaN

Q

3

895

3

Wirz, Mr. Albert

male

27.0

0

0

315154

8.6625

NaN

S

4

896

3

Hirvonen, Mrs. Alexander (Helga E Lindqvist)

female

22.0

1

1

3101298

12.2875

NaN

S

# 注意这里根本没有预测值(Survived列),所以就不用考虑跟着取出来它了
xt = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
xt = xt.copy()
xt.loc[:,'Embarked_S']=(xt.Embarked=='S').astype('int')
xt.loc[:,'Embarked_C']=(xt.Embarked=='C').astype('int')
xt.loc[:,'Embarked_Q']=(xt.Embarked=='Q').astype('int')
del xt['Embarked']
xt = pd.get_dummies(xt)
xt['Age'] = xt.Age.fillna(xt.Age.mean())
xt.loc[:,'P1'] = (xt.Pclass==1).astype('int') 
xt.loc[:,'P2'] = (xt.Pclass==2).astype('int') 
xt.loc[:,'P3'] = (xt.Pclass==3).astype('int') 
del xt['Pclass']
xt.head()



Age

SibSp

Parch

Fare

Embarked_S

Embarked_C

Embarked_Q

Sex_female

Sex_male

P1

P2

P3

0

34.5

0

0

7.8292

0

0

1

0

1

0

0

1

1

47.0

1

0

7.0000

1

0

0

1

0

0

0

1

2

62.0

0

0

9.6875

0

0

1

0

1

0

1

0

3

27.0

0

0

8.6625

1

0

0

0

1

0

0

1

4

22.0

1

1

12.2875

1

0

0

1

0

0

0

1

# 计算预测值
predictions = model.predict(xt)
type(predictions)
numpy.ndarray
# 生成提交csv
submission = pd.DataFrame({"PassengerId": df["PassengerId"], "Survived": (predictions.flatten()>0.5).astype('int')})
submission.to_csv("./data/tt_upload.csv", index=False)
E:\MyProgram\Anaconda\envs\krs\lib\site-packages\ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in greater