logistic回归临床预测模型 R logistic回归分析预测

转载

风华正茂的AI 2024-08-04 11:37:12

文章标签 logistic回归临床预测模型 R Keras 缺失值 文章分类 机器学习人工智能

Logistic回归预测Titanic

读取数据

import pandas as pd
import keras
from keras import layers
import numpy as np

Using TensorFlow backend.

data = pd.read_csv("./data/tt_train.csv")
data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

分析

PassengerId肯定不影响，去掉；Name没什么用，去掉；Ticket是票的id号也肯定不影响，取消；Cabin是船舱编码，虽然有影响，但是缺失值太多(从data.info就能看到)，所以也去掉；Embarked是船舱位置，也会影响是否获救。

数据预处理

去掉不要的特征、把非数值特征数值化、处理缺失值。

# 查看所有的列索引名,方便去复制要用的列
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

现在要取有用的列。x这个df用来保存特征，但这里先带着预测值Survived这一列，因为一会要做数据预处理，万一某些行被删掉了呢？等到预处理完了再把y取出来，并在x中把预测值这列删掉!

x = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked']]
x.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	0	3	male	22.0	1	0	7.2500	S
1	1	1	female	38.0	1	0	71.2833	C
2	1	3	female	26.0	0	0	7.9250	S
3	1	1	female	35.0	1	0	53.1000	S
4	0	3	male	35.0	0	0	8.0500	S

# 现在要处理Embarked(船舱位置),先看一下它有多少种取值
x.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

看出来了有3种取值(缺失值不算一种的话),这里用3个维度的one-hot编码。
这里要关注一下变换成one-hot编码的技巧。

# 原来的每一类,对所有样本都可以得到True/False
(x.Embarked=='S').head() # 太长了只看前5个

0     True
1    False
2     True
3     True
4     True
Name: Embarked, dtype: bool

# 转换成1/0
(x.Embarked=='S').astype('int').head()

0    1
1    0
2    1
3    1
4    1
Name: Embarked, dtype: int32

# 添加列之前copy一下,不然有警告
x=x.copy()
# 用上面的方式添加one-hot编码,因为是3维的所以要添加三列特征
x.loc[:,'Embarked_S']=(x.Embarked=='S').astype('int')
x.loc[:,'Embarked_C']=(x.Embarked=='C').astype('int')
x.loc[:,'Embarked_Q']=(x.Embarked=='Q').astype('int')
x.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Embarked_S	Embarked_C	Embarked_Q
0	0	3	male	22.0	1	0	7.2500	S	1	0	0
1	1	1	female	38.0	1	0	71.2833	C	0	1	0
2	1	3	female	26.0	0	0	7.9250	S	1	0	0
3	1	1	female	35.0	1	0	53.1000	S	1	0	0
4	0	3	male	35.0	0	0	8.0500	S	1	0	0

注意，在这种one-hot编码下，缺失值NaN显然是[0,0,0]

# 删除原始的Embarked列
del x['Embarked']
x.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_S	Embarked_C	Embarked_Q
0	0	3	male	22.0	1	0	7.2500	1	0	0
1	1	1	female	38.0	1	0	71.2833	0	1	0
2	1	3	female	26.0	0	0	7.9250	1	0	0
3	1	1	female	35.0	1	0	53.1000	1	0	0
4	0	3	male	35.0	0	0	8.0500	1	0	0

还剩Sex这一列没有数值化，这一列也要进行one-hot编码，这里不妨学习另一种方式：直接用pd.get_dummies()这个函数，它会将传入的dataframe的非数值列都进行one-hot编码。

x=pd.get_dummies(x)
x.head()

	Survived	Pclass	Age	SibSp	Parch	Fare	Embarked_S	Embarked_C	Embarked_Q	Sex_female	Sex_male
0	0	3	22.0	1	0	7.2500	1	0	0	0	1
1	1	1	38.0	1	0	71.2833	0	1	0	1	0
2	1	3	26.0	0	0	7.9250	1	0	0	1	0
3	1	1	35.0	1	0	53.1000	1	0	0	1	0
4	0	3	35.0	0	0	8.0500	1	0	0	0	1

思考一下，什么情况下两种取值的字段不用one-hot编码就能数值化？当不存在缺失值的时候，现在这个样本集x的性别确实就不存在缺失值，是可以直接转换成0/1，但是能保证验证集、测试集和以后的样本都不存在缺失数据吗？（Titanic数据集确实可以保证，因为这件事已经过去了，样本不会在增加了，但这是因为这个数据集"背景"比较特殊）

x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Age           714 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_S    891 non-null int32
Embarked_C    891 non-null int32
Embarked_Q    891 non-null int32
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB

# 处理Age的缺失值:用均值填充
x['Age'] = x.Age.fillna(x.Age.mean())
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_S    891 non-null int32
Embarked_C    891 non-null int32
Embarked_Q    891 non-null int32
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
dtypes: float64(2), int32(3), int64(4), uint8(2)
memory usage: 54.0 KB

x.head()

	Survived	Pclass	Age	SibSp	Parch	Fare	Embarked_S	Embarked_C	Embarked_Q	Sex_female	Sex_male
0	0	3	22.0	1	0	7.2500	1	0	0	0	1
1	1	1	38.0	1	0	71.2833	0	1	0	1	0
2	1	3	26.0	0	0	7.9250	1	0	0	1	0
3	1	1	35.0	1	0	53.1000	1	0	0	1	0
4	0	3	35.0	0	0	8.0500	1	0	0	0	1

现在看到Pclass这一列，这一列表示票的等级，它虽然值是数值的，但实际上并没有数值间的那种倍比、加减关系（试想如何向计算机阐释"三等票不是一等票的三倍，虽然它数值上是三倍关系"），本质上它还是一个"类别特征"，所以这里对它也进行one-hot编码。

x.loc[:,'P1'] = (x.Pclass==1).astype('int') 
x.loc[:,'P2'] = (x.Pclass==2).astype('int') 
x.loc[:,'P3'] = (x.Pclass==3).astype('int') 
del x['Pclass']
x.head()

	Survived	Age	SibSp	Parch	Fare	Embarked_S	Embarked_C	Embarked_Q	Sex_female	Sex_male	P1	P2	P3
0	0	22.0	1	0	7.2500	1	0	0	0	1	0	0	1
1	1	38.0	1	0	71.2833	0	1	0	1	0	1	0	0
2	1	26.0	0	0	7.9250	1	0	0	1	0	0	0	1
3	1	35.0	1	0	53.1000	1	0	0	1	0	1	0	0
4	0	35.0	0	0	8.0500	1	0	0	0	1	0	0	1

# 现在预处理完成了,把预测值取出来,并在x中把它删掉
y = data.Survived
del x['Survived']
x.shape, y.shape

((891, 12), (891,))

建立和训练模型

# 顺序模型
model = keras.Sequential()
#  全连接层,输出1维,输入12维度,使用Sigmoid作为激活函数
model.add(layers.Dense(1, input_dim=12, activation='sigmoid'))

WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1)                 13        
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________

# 编译模型
model.compile(
    optimizer='adam',
    loss='binary_crossentropy', # 这里用二元的交叉熵作为二分类的损失函数
    metrics=['acc'] # 在训练时输出accuracy(精度,即正确率)
)

# 训练模型,从返回值可以获得其训练过程中的一些信息
history = model.fit(x, y, epochs=300, verbose=0) # verbose=0不从std输出,不然导出markdown这块就太长了

WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

# 查看保留了哪些数据(该对象的history属性就是一个字典,keys()就是取字典的键)
history.history.keys()

dict_keys(['loss', 'acc'])

绘制训练过程

在字典中保留了loss和acc的变化情况(不是只保留最终值)，现在把它们绘制出来看一下。

import matplotlib.pyplot as plt
%matplotlib inline

绘制loss的变化过程：

# 这里300==len(history.history.get('loss')==len(history.history.get('acc')==epochs
plt.plot(range(300),history.history.get('loss'))

[<matplotlib.lines.Line2D at 0x151c0eb8>]

logistic回归临床预测模型 R logistic回归分析预测_Keras

绘制acc的变化过程：

plt.plot(range(300),history.history.get('acc'))

[<matplotlib.lines.Line2D at 0x1621a5f8>]

logistic回归临床预测模型 R logistic回归分析预测_logistic回归临床预测模型 R_02

导出预测值以提交到Kaggle

# 读取测试集,并做相同的预处理
df = pd.read_csv("./data/tt_test.csv")
df.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# 注意这里根本没有预测值(Survived列),所以就不用考虑跟着取出来它了
xt = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
xt = xt.copy()
xt.loc[:,'Embarked_S']=(xt.Embarked=='S').astype('int')
xt.loc[:,'Embarked_C']=(xt.Embarked=='C').astype('int')
xt.loc[:,'Embarked_Q']=(xt.Embarked=='Q').astype('int')
del xt['Embarked']
xt = pd.get_dummies(xt)
xt['Age'] = xt.Age.fillna(xt.Age.mean())
xt.loc[:,'P1'] = (xt.Pclass==1).astype('int') 
xt.loc[:,'P2'] = (xt.Pclass==2).astype('int') 
xt.loc[:,'P3'] = (xt.Pclass==3).astype('int') 
del xt['Pclass']
xt.head()

	Age	SibSp	Parch	Fare	Embarked_S	Embarked_C	Embarked_Q	Sex_female	Sex_male	P1	P2	P3
0	34.5	0	0	7.8292	0	0	1	0	1	0	0	1
1	47.0	1	0	7.0000	1	0	0	1	0	0	0	1
2	62.0	0	0	9.6875	0	0	1	0	1	0	1	0
3	27.0	0	0	8.6625	1	0	0	0	1	0	0	1
4	22.0	1	1	12.2875	1	0	0	1	0	0	0	1

# 计算预测值
predictions = model.predict(xt)

type(predictions)

numpy.ndarray

# 生成提交csv
submission = pd.DataFrame({"PassengerId": df["PassengerId"], "Survived": (predictions.flatten()>0.5).astype('int')})
submission.to_csv("./data/tt_upload.csv", index=False)

E:\MyProgram\Anaconda\envs\krs\lib\site-packages\ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in greater

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python手动分开训练集和测试集划分测试集和训练集

下一篇：java ThreadFactory 设置线程名称 java thread 方法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯