python clickhouse 读取 pandas clickhouse

转载

mob6454cc6a8ab0 2023-12-14 22:22:44

文章标签 python 数据分析 etl 数据仓库数据 文章分类 Python 后端开发

爬下来的数据就可以进行数据清洗啦！首先确定需要处理的字段。因为后续准备做回归，所以我的变量设置是这样的：

python clickhouse 读取 pandas clickhouse_python

清洗前的数据如下所示：

python clickhouse 读取 pandas clickhouse_数据_02

python clickhouse 读取 pandas clickhouse_数据_03

结合模型的变量、数据的字段，可以总结出数据清洗阶段需要完成的任务：

house_address中的区级行政区、街道和小区通过连字符连接，需要将其拆分
house_rental_area中的面积是字符串格式，需要删掉面积符号再将其转换为数字格式
house_layout包含了三个变量，需要将其切片
house_floor中的变量分为地下室、低楼层、中楼层、高楼层，需要将其转换为定序变量
house_rental_price中的价格是字符串格式，需要删掉单位再将其转换为数字格式
house_tag中只需要提取出是否精装、是否临近地铁两个定性变量
house_elevator、house_heating、house_electricity同属定性变量，需要将其分类为0和1。house_water与house_electricity都是反映房屋是否商用，故只保留house_electricity

需要注意的是：

house_layout字段中存在“x室x厅x卫”和“x房间x卫”两种表述方式，分析发现“x房间x卫”意味着没有living room，所以处理时现将“房间”换为“'室0厅”，以便后续切片
有些变量中存在“暂无数据”字段，需要将这列数据删除
爬下来的数据是有中文字符的，需要注意encoding的方式！总之utf_8_sig或者gbk都可以试试

需要用到的函数主要有：

df1['house_tag'].str.contains('精装')注意该函数返回的是布尔值
df1['room_num'] = df1['house_ayout'].str[0:1] 统计房间数量
df1['house_heating']=df1['house_heating'].repace(['自采暖','集中供暖'],[0,1])将‘自采暖’repace成0（int）；‘集中采暖’换成1
df1 = pd.concat([df,df['house_address'].str.spit('-',expand=True)],axis=1).drop('house_address',axis=1)按照‘-’分割得到三列再弃掉['house_address']列
df1.rename(coumns={0: 'house_district', 1: 'house_street',2: 'house_apartment_compexes'}, inpace=True)分割字符串得到的新的三列重命名
df1.drop(df1[df1["house_water"]=='暂无数据'].index,inpace = True)删除无效数据

代码如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(pd.read_excel("D:\filename.xlsx"))
df.shape
df.info()
df.describe()

df1 = pd.concat([df,df['house_address'].str.split('-',expand=True)],axis=1).drop('house_address',axis=1) #按照’-‘切割
df1.rename(columns={0: 'house_district', 1: 'house_street',2: 'house_apartment_complexes'}, inplace=True) #重命名
df1['house_layout']=df1['house_layout'].str.replace('房间', '室0厅')
print('--------')
#print(df1['house_layout'].str[2:4])
df1.drop(df1[df1["house_water"]=='暂无数据'].index,inplace = True) #删除无效数据
df1.drop(df1[df1["house_electricity"]=='暂无数据'].index,inplace = True)
df1.drop(df1[df1["house_heating"]=='暂无数据'].index,inplace = True)
df1['room_num'] = df1['house_layout'].str[0:1]    #只需要统计房间、客厅、卫生间数量
df1['living_room_num'] = df1['house_layout'].str[2:3]
df1['bath_room_num'] = df1['house_layout'].str[4:5]
df1['subway'] = df1['house_tag'].str.contains('近地铁') #若house_tag中含有'近地铁'的描述，则新增的subway列中填上布尔值true
df1['refine'] = df1['house_tag'].str.contains('精装')
df1['house_heating']=df1['house_heating'].replace(['自采暖','集中供暖'],[0,1])
df1['house_gas']=df1['house_gas'].replace(['无','有'],[0,1])
df1['house_electricity'] = df1['house_electricity'].replace(['商电','民电'],[0,1])
df1['house_rental_area']=df1['house_rental_area'].str.rstrip('㎡')
df1['house_rental_price']=df1['house_rental_price'].str.rstrip('元/月')
df1['house_rental_area']= pd.to_numeric(df1['house_rental_area']) #字符串转换为数字
df1['house_rental_price']= pd.to_numeric(df1['house_rental_price'])
df1['house_floor1'] = df1['house_floor'].str[0:1] #只需要切第一个字就可以完成分类
df1['house_floor1'] = df1['house_floor1'].replace(['地','低','中','高'],[0,1,2,3])
#print("Datatype of Cost column after type conversion:")
#print(df1['house_rental_area'].dtypes)
df1=df1.replace([True,False],[1,0])  #将前面的布尔值换成0、1以便回归
df1.to_csv("D:\filename.csv",encoding="utf_8_sig")

清洗后的数据的部分字段如下所示：

总之数据清洗要结合手上的数据进行具体的字段、数据类型的分析，然后多翻阅一下pandas手册就可以啦！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python 运行ncnn模型 snn python

下一篇：vue中axios不设超时时间 vue超时设置

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python clickhouse 读取 pandas clickhouse

python clickhouse 读取 pandas clickhouse

结合模型的变量、数据的字段，可以总结出数据清洗阶段需要完成的任务：

需要注意的是：

需要用到的函数主要有：

51CTO博客