导语
如果说大数据里面hive是屠龙刀,那么pandas则是倚天剑,帮助我们对数据数据挖掘、数据分析、数据清洗。
本篇介绍了Pandas 一些基础的语法,以及使用技巧,建议收藏~
目录
-
数据准备
-
Dataframe 基础操作
2.1 查看
2.2 修改
2.3 过滤
2.4 排序
2.5 去重
2.6 聚合
2.7 关联
2.8 自定义函数
2.9 索引操作
2.10 空值处理
2.11 to_csv 写入csv文件
1. 数据准备
在python IDE平台上运行以下脚本:
import pandas as pd
import numpy as np
df=pd.DataFrame([['A10','Sone',2,'20200801'],
['A10','welsh',3,'20200801'],['A10','Sone',16,'20200801'],['A10','Albert',20,'20200802'],
['A10','GG',32,' 20200801'],['A20','Albert',42,' 20200801'],['A20','welsh',10,'20200801'],['A20','welsh',15,'20200802'],['A10','Albert',20,'20200801'],['A20','Sone',np.NaN,'20200802'],['A20','welsh',15,'20200802'],['A20','Albert',10,'20200802'],['A10','Jojo',16,'20200802'],
['A20','welsh',35,'20200803'],['A10','welsh',33,'20200803'],['A20','Sone',66,'20200803'],
['A20','Jojo',15,'20200802'],['A10','Albert',53,'20200803'],['A10','Jojo',12,'20200803'],
['A20','GG',35,'20200803'],['A20','J.K',30,'20200803']
],index=[[x for x in range(21)]], columns=['site_id','user_name','pv','dt'])
site=pd.DataFrame([['A02','北京东直门'],['A10','北京朝阳门店'],['A06','北京六里桥店'],['A20','北京西黄村店']],
index=[[x for x in range(4)]],columns=['site_id','site_name'])
数据预览:
site_id | user_name | pv | dt |
---|---|---|---|
A10 | Sone | 2 | 20200801 |
A10 | welsh | 3 | 20200801 |
A10 | Sone | 16 | 20200801 |
A10 | Albert | 20 | 20200802 |
A10 | GG | 32 | 20200801 |
A20 | Albert | 42 | 20200801 |
A20 | welsh | 10 | 20200801 |
A20 | welsh | 15 | 20200802 |
A10 | Albert | 20 | 20200801 |
A20 | Sone | NaN | 20200802 |
A20 | welsh | 15 | 20200802 |
A20 | Albert | 10 | 20200802 |
A10 | Jojo | 16 | 20200802 |
A20 | welsh | 35 | 20200803 |
A10 | welsh | 33 | 20200803 |
A20 | Sone | 66 | 20200803 |
A20 | Jojo | 15 | 20200802 |
A10 | Albert | 53 | 20200803 |
A10 | Jojo | 12 | 20200803 |
A20 | GG | 35 | 20200803 |
A20 | J.K | 30 | 20200803 |
门店预览:
site_id | user_name |
---|---|
A02 | 北京东直门 |
A10 | 北京朝阳门店 |
A06 | 北京六里桥店 |
A20 | 北京西黄村店 |
2. Dataframe 基础操作
Pandas的Dataframe 结构其实就是一种二维数组,由列、行、索引组成,跟mysql结构类似,
主要从表查看、修改、过滤、排序、聚合、关联、空值处理的一些基础语法介绍。
2.1 查看
- columns 获取列名
df.columns
# 输出:
Index(['site_id', 'user_name', 'pv', 'dt'], dtype='object')
- index 获取索引
df.index
# 输出:
MultiIndex([( 0,),
( 1,),
( 2,),
( 3,),
...
(19,),
(20,)],
)
- values 获取数据
df.values
# 输出:
array([['A10', 'Sone', 2, '20200801'],
['A10', 'welsh', 3, '20200801'],
['A10', 'Sone', 16, '20200801'],
['A10', 'Albert', 20, '20200802'],
...
['A10', 'Jojo', 12, '20200803'],
['A20', 'GG', 35, '20200803'],
['A20', 'J.K', 30, '20200803']], dtype=object)
- dtypes 查看类型
df.dtypes
# 输出:
site_id object
user_name object
pv object
dt object
dtype: object
备注:进行2表之间关联时,往往需要确认关联的2个字段的类型是否一致,不一致时需要astype转化,例:df["dt"] = df["dt"].astype("int64")
- head 头部获取
df.head(2) # 展示头2行
# 输出:
site_id user_name pv dt
0 A10 Sone Sone 20200801
1 A10 welsh welsh 20200801
- df.xx/loc 列查看
df.name # 单列展示
# 输出:
0 Sone
1 welsh
2 Sone
...
18 Jojo
19 GG
20 J.K
Name: user_name, dtype: object
df.loc[:,['name','pv']] # 多列展示
# 输出:
user_name pv
0 Sone 2
1 welsh 3
2 Sone 16
...
19 GG 35
20 J.K 30
- iloc 行查看
df.iloc[[0,1,8],] # 展示index为0、1、8的行
# 输出:
site_id user_name pv dt
0 A10 Sone 2 20200801
1 A10 welsh 3 20200801
8 A10 Albert 20 20200801
- shape 列行整体统计
df.shape # 输出21列,4行
# 输出:
(21, 4)
- count 某列统计
df.pv.count()
# 输出:
20
说明:count() 统计的总数不包含NaN
2.2 修改
- rename 某列修改
df.rename(columns={'pv': 'page_view'})
# 输出:
site_id user_name page_view dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
2 A10 Sone 16.0 20200801
...
19 A20 GG 35.0 20200803
20 A20 J.K 30.0 20200803
说明:需要重新赋值给原表,原表值才会生效改变,df = df.rename(columns={'pv': 'page_view'})
- drop 列去掉
df.drop(['dt'], axis=1)
# 输出:
site_id user_name pv
0 A10 Sone 2.0
1 A10 welsh 3.0
2 A10 Sone 16.0
3 A10 Albert 20.0
...
19 A20 GG 35.0
20 A20 J.K 30.0
说明:需要重新赋值给原表,原表值才会生效改变,df = df.drop(['dt'], axis=1)
- df['xx'] 某列新增
df['copy_dt']=df['dt'] # 新增df['copy_dt']列,复制['dt']这列而来
df
# 输出:
site_id user_name pv dt copy_dt
0 A10 Sone 2.0 20200801 20200801
1 A10 welsh 3.0 20200801 20200801
2 A10 Sone 16.0 20200801 20200801
...
19 A20 GG 35.0 20200803 20200803
20 A20 J.K 30.0 20200803 20200803
2.3 过滤
- df[xx>x] 单条件过滤
df[df.pv>30] # pv值大于30的数据
# 输出:
site_id user_name pv dt
4 A10 GG 32.0 20200801
5 A20 Albert 42.0 20200801
13 A20 welsh 35.0 20200803
14 A10 welsh 33.0 20200803
15 A20 Sone 66.0 20200803
17 A10 Albert 53.0 20200803
19 A20 GG 35.0 20200803
- df[(xx>x)&(yy==y)] 多条件过滤
df["dt"] = df["dt"].astype("int64") # 先将dt转换成int64类型
df[(df.pv>30) & (df.dt==20200801)] # 过滤出pv>30 且 dt是0801这天的
# 输出:
site_id user_name pv dt
4 A10 GG 32.0 20200801
5 A20 Albert 42.0 20200801
2.4 排序
- sort_values 基于值排序
df.sort_values(by=["pv"],ascending=False) # pv 倒叙
# 输出:
site_id user_name pv dt
15 A20 Sone 66.0 20200803
17 A10 Albert 53.0 20200803
5 A20 Albert 42.0 20200801
19 A20 GG 35.0 20200803
...
1 A10 welsh 3.0 20200801
0 A10 Sone 2.0 20200801
9 A20 Sone NaN 20200802
df.sort_values(by=["pv"],ascending=True) # pv 正序
# 输出:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
11 A20 Albert 10.0 20200802
6 A20 welsh 10.0 20200801
...
17 A10 Albert 53.0 20200803
15 A20 Sone 66.0 20200803
9 A20 Sone NaN 20200802
说明:pv是null的数据,无论是正序还是倒叙均排在最后,进行排序时需要先进行null值处理
- sort_index 基于index排序
df=df.sort_index(axis=0)
# 输出:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
2 A10 Sone 16.0 20200801
...
19 A20 GG 35.0 20200803
20 A20 J.K 30.0 20200803
说明:当我们进行聚合后index会乱序,所以这些我们需要用到基于index进行排序
2.5 去重统计
- nunique 基于某列去重
df.groupby('site_id').agg({'user_name': pd.Series.nunique}) # A10下5个用户,A20下6个用户
# 输出:
user_name
site_id
A10 5
A20 6
2.6 聚合
- groupby('xx') 基于单列聚合
df.groupby('site_id').count()
# 输出:
user_name pv dt
site_id
A10 10 10 10
A20 11 10 11
df.groupby('site_id').min()
# 输出:
user_name pv dt
site_id
A10 Albert 2.0 20200801
A20 Albert 10.0 20200801
df.groupby('site_id').max()
# 输出:
user_name pv dt
site_id
A10 welsh 53.0 20200803
A20 welsh 66.0 20200803
说明:聚合函数支持:count()| min()| max()| avg()| meav()| std() | var() ,计算非NaN的数据
- groupby(['xx','yy']).agg 基于多列聚合
df.groupby(['site_id','user_name']).agg({'pv': 'sum','dt':'count'})
# 输出:
pv dt
site_id user_name
A10 Albert 93.0 3
GG 32.0 1
Jojo 28.0 2
Sone 18.0 2
welsh 36.0 2
A20 Albert 52.0 2
GG 35.0 1
J.K 30.0 1
Jojo 15.0 1
Sone 66.0 2
welsh 75.0 4
2.7 关联
- merge 基于字段关联
df= pd.merge(df,site,how='inner',on='site_id')
# 输出:
site_id user_name pv dt site_name
0 A10 Sone 2.0 20200801 北京朝阳门店
1 A10 welsh 3.0 20200801 北京朝阳门店
...
19 A20 GG 35.0 20200803 北京西黄村店
20 A20 J.K 30.0 20200803 北京西黄村店
- left_index 基于index关联
df = df.groupby("site_id").count()
df= pd.merge(df,site,how='inner',left_index=True,right_on="site_id")
# 输出:
user_name pv dt site_id site_name
1 10 10 10 A10 北京朝阳门店
3 11 10 11 A20 北京西黄村店
说明: 表A基于site_id字段进行聚合后,然后site_id字段变成表A的index,然后表A的index与表B的字段site_id在进行聚合,最终带出site_name
2.8 自定义函数
- 例如我们想将 pv 与 dt字段进行拼接后生成,可以用apply 之 lambda 函数实现
df['pv']=df['pv'].astype("str") # pv字段转成str
df['dt']=df['dt'].astype("str") # dt字段转成str
df['pv_dt'] = df.apply(lambda r:(r['pv'] +"_"+ r['dt']),axis=1) # 将pv与dt进行拼接
# 输出:
site_id user_name pv dt pv_dt
0 A10 Sone 2.0 20200801 2.0_20200801
1 A10 welsh 3.0 20200801 3.0_20200801
2 A10 Sone 16.0 20200801 16.0_20200801
...
18 A10 Jojo 12.0 20200803 12.0_20200803
19 A20 GG 35.0 20200803 35.0_20200803
20 A20 J.K 30.0 20200803 30.0_20200803
- 方法二,自定义函数
def str_split(sub_pdf:pd.DataFrame):
sub_pdf['pv_dt'] = sub_pdf['pv']+"_"+sub_pdf['dt']
return sub_pdf
df['ab_pro'] = df.apply(str_split, axis=1)
# 输出:
site_id user_name pv dt pv_dt
0 A10 Sone 2.0 20200801 2.0_20200801
1 A10 welsh 3.0 20200801 3.0_20200801
2 A10 Sone 16.0 20200801 16.0_20200801
...
18 A10 Jojo 12.0 20200803 12.0_20200803
19 A20 GG 35.0 20200803 35.0_20200803
20 A20 J.K 30.0 20200803 30.0_20200803
2.9 索引操作
- reset_index 重排序索引,一般是针对聚合后的数据,对其索引进行重排
df = df.groupby("user_name").count() # 此时索引是user_name
# 输出:
site_id pv dt
user_name
Albert 5 5 5
GG 2 2 2
J.K 1 1 1
Jojo 3 3 3
Sone 4 3 4
welsh 6 6 6
df.reset_index('user_name')
# 输出:
user_name site_id pv dt # 重排后的索引
0 Albert 5 5 5
1 GG 2 2 2
2 J.K 1 1 1
3 Jojo 3 3 3
4 Sone 4 3 4
5 welsh 6 6 6
- set_index 某列指定为索引
df.set_index("site_id")
# 输出:
user_name pv dt
site_id
A10 Sone 2.0 20200801
A10 welsh 3.0 20200801
A10 Sone 16.0 20200801
...
A20 Jojo 15.0 20200802
A10 Albert 53.0 20200803
A10 Jojo 12.0 20200803
2.10 空值处理
- isnull() 空值统计,True表示该列含有空值,false表示该列不含空值,通常与any()看哪些列是空值,sum()看各列空值的数量
df.isnull().any() # 统计
# 输出:
site_id False
user_name False
pv True
dt False
dtype: bool
df.isnull().sum()
# 输出:
site_id 0
user_name 0
pv 1
dt 0
dtype: int64
- notnull() 非空统计,True表示该列含有非空,false表示该列全为空值,
df.notnull().any()
# 输出:
site_id True
user_name True
pv True
dt True
dtype: bool
- 空值填充, Sone的pv值被填充为0
df['pv'] = df.pv.fillna(0)
df
# 输出:
site_id user_name pv dt
0 A10 Sone 2.0 20200801
1 A10 welsh 3.0 20200801
..
9 A20 Sone 0.0 20200802
...
20 A20 J.K 30.0 20200803
2.11 to_csv 写入csv文件
df.to_csv("pv.csv")
3. Series 基础操作
Pandas的Series 结构其实就是一种一维数组,由列、索引组成,类似一种单列的mysql表结构,从查看、统计、过滤、聚合、。
3.1 查看
- head 头部查看
user_name = df['user_name']
user_name.head(2)
# 输出:
0 Sone
1 welsh
Name: user_name, dtype: object
3.2 统计
- shape 行统计
user_name = df['user_name']
user_name.shape
# 输出:
(21,)
3.3 过滤
- df[xx=='x']
user_name = df['user_name']
user_name[user_name=='Sone']
# 输出:
0 Sone
2 Sone
9 Sone
15 Sone
Name: user_name, dtype: object
3.4 排序
- sort_values
user_name = df['user_name']
user_name.sort_values()
# 输出:
17 Albert
3 Albert
5 Albert
8 Albert
...
13 welsh
14 welsh
7 welsh
6 welsh
1 welsh
10 welsh
Name: user_name, dtype: object
3.5 聚合
user_name = df['user_name']
user_name.count()
# 输出:
21
3.6 空值处理
- isnull()空值统计
pv = df['pv']
pv.isnull().sum()
# 输出:
1
- fillna(0)空值统计
pv = df['pv']
pv = pv.fillna(0)
# 输出:
0 2.0
...
9 0.0
...
20 30.0
Name: pv, dtype: float64
关注我的微信公众号【数据猿温大大】
获取 pandas官方文档中文版