泰坦尼克号数据挖掘分析报告Pmt 泰坦尼克号数据分析python

转载

mob6454cc63f2dd 2024-06-11 22:20:05

文章标签 泰坦尼克号数据挖掘分析报告Pmt 数据线性回归随机森林 文章分类 数据挖掘人工智能

这里写自定义目录标题

泰坦尼克号Titanic

读入数据

1、读取数据
2、读入csv\excel\txt

数据可视化分析

数据分析

1、数据处理—特征工程(feature engineering)
2、线性回归
3、逻辑回归
4、随机森林

泰坦尼克号Titanic

Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

读入数据

1、读取数据

pandas是常用的python数据处理包 ,它能够把csv文件读入成dataframe格式。
pandas库详细介绍链接https://www.pypandas.cn/docs/

import pandas
titanic = pandas.read_csv("train.csv")
#head()函数参数表示打印出几行数据，默认为五
head3=titanic.head(3)
print(head3)
#描述性数据，均值最值等
print(titanic.describe())
#数据属性和个数
print(titanic.info())

读入数据总共有12列，其中Survived字段表示的是该乘客是否获救，其余都是乘客的个人信息，包括：

PassengerId => 乘客ID

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

2、读入csv\excel\txt

excel和csv
https://www.jianshu.com/p/0fd5551bac37 pandas读入其他方式读入

数据可视化分析

通过可视化图形初步了解数据情况及其与是否存活的关系

图

单个特征与存活率关系
1、乘客等级Pclass与survived关系，某一等级对应存活率之比
2、存活人数中男女比（饼状图）
3、总体年龄频率直方图、是否存活分别的年龄分布（横坐标为survived）
4、兄弟姐妹/父母孩子个数SibSp/Parch，同上。或者横坐标为个数
5、票价
6、登船港口

数据内部关系
各等级车厢年龄分布（三条曲线分布表示不同等级，x为年龄）
登船港口和票价/乘客等级
家庭人口与存活率
舱位等级和性别共同影响生存率

matplotlib教程https://www.ctolib.com/docs/sfile/matplotlib-intro/index.html

数据分析

1、数据处理—特征工程(feature engineering)

缺失值填充

mage = titanic["Age"].median()
titanic["Age"] = titanic["Age"].fillna(mage)
#将空值用平均值替换
print(titanic.describe())

替换string为int类型

print(titanic["Sex"].unique()) 
#对于一维数组或者列表，unique函数去除其中重复的元素，
#并按元素由大到小返回一个新的无元素重复的元组或者列表

#print(titanic["Sex"])  
#返回series类型

print(type(titanic["Sex"]))
#.unique()加括号只打印不重复的值，不加括号打印所有值的对应值
#现在的语法是values()?
#print(titanic["Sex"].values) 
#.values()加括号 错误
#series对象区别于字典，

titanic.loc[titanic["Sex"] == "male","Sex"] = 0
titanic.loc[titanic["Sex"] == "female","Sex"] = 1

缺失值填充及替换为int类型

print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna("S")
#没有均值的时候，选择一个出现次数较多的值进行填充
titanic.loc[titanic["Embarked"] == "S","Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C","Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q","Embarked"] = 2

2、线性回归

#二分类  线性回归

#Scikit-learn python机器学习库
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

#初始化  赋值函数
alg = LinearRegression()

#样本均分为3份，3折交叉验证
kf = KFold(n_splits = 3,shuffle = False,random_state = 1)

predictions = []
#
for train,test in kf.split(titanic):
    #获取训练集的值
    train_predictors = (titanic[predictors].iloc[train,:])
    #获取label值
    #对于单独一列值，iloc()只能有一个参数
    train_target = titanic["Survived"].iloc[train]
    #训练模型
    alg.fit(train_predictors,train_target)
    #使用测试集检验
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    #测试结果
    predictions.append(test_predictions)

计算准确率

import numpy as np
#将二维数组转换成一维
predictions = np.concatenate(predictions,axis=0)

#映射成分类结果，计算准确率
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0

#
accuracy = sum(predictions == titanic["Survived"])/len(predictions)
#predictions == titanic["Survived"]   boolean类型，相同为true值为1

print(accuracy)
#二分类，本身准确率就应该有50%

输出为0.7833894500561167

3、逻辑回归

#逻辑回归
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

alg = LogisticRegression(random_state = 1)

scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = 3)
print(scores.mean())

输出为0.7957351290684623

#上述结果使用的是交叉验证的验证集进行的分类，实际结果中应该使用测试集
titanic_test = pandas.csv("test.csv")
#其他处理数据过程同上

4、随机森林

#随机森林
#有放回的的取值，随机取特征值（可以指定个数）
#构造了多个决策树？ 哪个影响因素对最终结果影响更大，防止过拟合，剔除负面因素
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

alg = RandomForestClassifier(random_state=1,
                          n_estimators=10,#决策树数量
                          min_samples_split=2,
                          min_samples_leaf=1)


kf = KFold(n_splits=3,shuffle=False,random_state=1)
scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = kf)
print(scores.mean())

输出为0.7856341189674523，结果不是很理想，所以要调参

alg = RandomForestClassifier(random_state=1,
                             n_estimators=100,
                             min_samples_split=4,
                             min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = KFold(n_splits=3, shuffle=False, random_state=1)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
 
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

输出为0.8148148148148148

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。