Pandas有两个最主要也是最重要的数据结构: Series 和 DataFrame
1. 导包
In [1]: import numpy as np
In [2]: import pandas as pd
2. 创建DataFrame对象
2.1 通过numpy数组(array)创建, 以datetime为索引,并设置列标签
In [4]: dates = pd.date_range('20190601',periods = 4)
In [5]: dates
Out[5]:
DatetimeIndex(['2019-06-01', '2019-06-02', '2019-06-03', '2019-06-04'], dtype='datetime64[ns]', freq='D')
In [10]: np.random.seed(1) # 设置随机数种子,确保随机得到的df1是相同的(参数 1 可以是任意一个整数)
In [11]: df1 = pd.DataFrame(np.random.randint(-10,20,size=(4,5)), index=dates, columns=list('ABCDE'))
In [12]: df1
Out[12]:
A B C D E
2019-06-01 -5 1 2 -2 -1
2019-06-02 1 -5 5 -10 6
2019-06-03 -9 2 -3 3 18
2019-06-04 -4 15 8 10 -5
2.2 通过可以转换为Series的字典创建
In [13]: df2 = pd.DataFrame({ 'A' : 3.,
...: 'B' : pd.Timestamp('20190601'),
...: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
...: 'D' : np.array([3] * 4,dtype='int32'),
...: 'E' : pd.Categorical(["tiger","pig","duck","cat"]),
...: 'F' : 'str' })
In [14]: df2
Out[14]:
A B C D E F
0 3.0 2019-06-01 1.0 3 tiger str
1 3.0 2019-06-01 1.0 3 pig str
2 3.0 2019-06-01 1.0 3 duck str
3 3.0 2019-06-01 1.0 3 cat str
In [96]: df3 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
In [97]: df3
Out[97]:
A B C D
0 foo one -0.528172 1.462108
1 bar one -1.072969 -2.060141
2 foo two 0.865408 -0.322417
3 bar three -2.301539 -0.384054
4 foo two 1.744812 1.133769
5 bar two -0.761207 -1.099891
6 foo one 0.319039 -0.172428
7 foo three -0.249370 -0.877858
3. 数据转置
In [53]: df1.T
Out[53]:
2019-06-01 2019-06-02 2019-06-03 2019-06-04
A -2 7 17 4
B 3 14 7 4
C 8 2 8 18
D 2 -9 -1 4
E 19 17 1 4
4. 排序
- 按
axis=0
轴的 索引 降序排序 (升序去掉ascending=False
)
In [57]: df1.sort_index(axis=0, ascending=False)
Out[57]:
A B C D E
2019-06-04 4 4 18 4 4
2019-06-03 17 7 8 -1 1
2019-06-02 7 14 2 -9 17
2019-06-01 -2 3 8 2 19
- 按
axis=1
轴的 列标签 降序排序 (升序去掉ascending=False
)
In [54]: df1.sort_index(axis=1, ascending=False)
Out[54]:
E D C B A
2019-06-01 19 2 8 3 -2
2019-06-02 17 -9 2 14 7
2019-06-03 1 -1 8 7 17
2019-06-04 4 4 18 4 4
- 按 列标签
'B'
升序排序 (降序加上ascending=False
)
In [59]: df1.sort_values(by='B')
Out[59]:
A B C D E
2019-06-01 -2 3 8 2 19
2019-06-04 4 4 18 4 4
2019-06-03 17 7 8 -1 1
2019-06-02 7 14 2 -9 17
5. 数据查询操作
5.1 查询前n行:DataFrame.head();查看后n行:DataFrame.tail()
In [15]: df1.head(2)
Out[15]:
A B C D E
2019-06-01 -5 1 2 -2 -1
2019-06-02 1 -5 5 -10 6
In [16]: df1.tail(3)
Out[16]:
A B C D E
2019-06-02 1 -5 5 -10 6
2019-06-03 -9 2 -3 3 18
2019-06-04 -4 15 8 10 -5
5.2 查询数据索引index、列标签columns和数据values
In [17]: df1.index
Out[17]: DatetimeIndex(['2019-06-01', '2019-06-02', '2019-06-03', '2019-06-04'], dtype='datetime64[ns]', freq='D')
In [18]: df1.columns
Out[18]: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
In [19]: df1.values
Out[19]:
array([[ -5, 1, 2, -2, -1],
[ 1, -5, 5, -10, 6],
[ -9, 2, -3, 3, 18],
[ -4, 15, 8, 10, -5]])
5.3 查询数据基本信息描述
说明:count
: 行数,mean
: 列平均值,std
: 列标准差,min
: 列最小值,max
: 列最大值
In [52]: df1.describe()
Out[52]:
A B C D E
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean -4.250000 3.250000 3.000000 0.250000 4.500000
std 4.112988 8.421203 4.690416 8.421203 10.082989
min -9.000000 -5.000000 -3.000000 -10.000000 -5.000000
25% -6.000000 -0.500000 0.750000 -4.000000 -2.000000
50% -4.500000 1.500000 3.500000 0.500000 2.500000
75% -2.750000 5.250000 5.750000 4.750000 9.000000
max 1.000000 15.000000 8.000000 10.000000 18.000000
5.4 查询数据
a. 获取 单独一列(发生降维), df['A']
<=>df.A
In [61]: df1['A'] # 返回的对象,发生降维
Out[61]:
2019-06-01 -5
2019-06-02 1
2019-06-03 -9
2019-06-04 -4
Freq: D, Name: A, dtype: int64
In [62]: type(df1['A'])
Out[62]: pandas.core.series.Series
b. 切片(行)
In [31]: df1[0:3] # 索引:包括起始,不包括结束
Out[31]:
A B C D E
2019-06-01 -5 1 2 -2 -1
2019-06-02 1 -5 5 -10 6
2019-06-03 -9 2 -3 3 18
In [33]: df1['2019-06-01':'2019-06-03'] # 行标签:包括起始,也包括结束
Out[33]:
A B C D E
2019-06-01 -5 1 2 -2 -1
2019-06-02 1 -5 5 -10 6
2019-06-03 -9 2 -3 3 18
5.5 查询数据
a. 获取 单独一行(发生降维)
In [35]: df1.loc[dates[0]] # 返回的对象,发生降维
Out[35]:
A -5
B 1
C 2
D -2
E -1
Name: 2019-06-01 00:00:00, dtype: int64
In [36]: type(df1.loc[dates[0]])
Out[36]: pandas.core.series.Series
b. 通过 行切片 + 列位置 获取对应数据
In [37]: df1.loc[:,['A','B']]
Out[37]:
A B
2019-06-01 -5 1
2019-06-02 1 -5
2019-06-03 -9 2
2019-06-04 -4 15
In [40]: df1.loc['2019-06-02':'2019-06-03',['A','B']]
Out[40]:
A B
2019-06-02 1 -5
2019-06-03 -9 2
In [41]: df1.loc['2019-06-02':'2019-06-03']
Out[41]:
A B C D E
2019-06-02 1 -5 5 -10 6
2019-06-03 -9 2 -3 3 18
c. 通过 行标签 + 列位置 获取对应数据(发生降维)
In [42]: df1.loc['2019-06-02',['A','B']]
Out[42]:
A 1
B -5
Name: 2019-06-02 00:00:00, dtype: int64
d. 获取指定行和列数据(标量)
In [44]: df1.loc[dates[1],'A']
Out[44]: 1
In [45]: df1.at[dates[1],'A'] # 快速获取一个标量值
Out[45]: 1
5.6 查询数据
a. 获取 单独一行(发生降维)
In [46]: df1.iloc[3]
Out[46]:
A -4
B 15
C 8
D 10
E -5
Name: 2019-06-04 00:00:00, dtype: int64
b. 切片
In [51]: df1.iloc[2:,0:3]
Out[51]:
A B C
2019-06-03 -9 2 -3
2019-06-04 -4 15 8
In [54]: df1.iloc[:,1:3] # 查询第二列-->第三列的所有数据
Out[54]:
B C
2019-06-01 1 2
2019-06-02 -5 5
2019-06-03 2 -3
2019-06-04 15 8
In [55]: df1.iloc[2:4,:] # 查询第三行-->第四行的所有数据
Out[55]:
A B C D E
2019-06-03 -9 2 -3 3 18
2019-06-04 -4 15 8 10 -5
c. 获取对应 位置 数据
In [52]: df1.iloc[[0,2],[1,2,4]]
Out[52]:
B C E
2019-06-01 1 2 -1
2019-06-03 2 -3 18
d. 获取指定行和列数据(标量)DataFrame.iat[1,1]
In [57]: df1.iat[1,1] # 查询第二行,第二列的数据
Out[57]: -5
5.7 查询数据(boolean条件)
In [39]: df1[df1.B > 0]
Out[39]:
A B C D E
2019-06-01 -5 1 2 -2 -1
2019-06-03 -9 2 -3 3 18
2019-06-04 -4 15 8 10 -5
In [40]: df1[df1 > 0]
Out[40]:
A B C D E
2019-06-01 NaN 1.0 2.0 NaN NaN # pandas用np.nan表示缺失数据
2019-06-02 1.0 NaN 5.0 NaN 6.0
2019-06-03 NaN 2.0 NaN 3.0 18.0
2019-06-04 NaN 15.0 8.0 10.0 NaN
5.8 查询数据(isin()方法)
In [46]: df2[df2['E'].isin(['cat','tiger'])] # 查询'E'列中指定数据('cat','tiger')“所在行的所有数据”
Out[46]:
A B C D E F
0 3.0 2019-06-01 1.0 3 tiger str
3 3.0 2019-06-01 1.0 3 cat str
6. 修改数据
In [48]: df1.loc[dates[0]:dates[1],'B'] = 999 # 修改'B'列,第一、二行的两个数据
In [49]: df1
Out[49]:
A B C D E
2019-06-01 -5 999 2 -2 -1
2019-06-02 1 999 5 -10 6
2019-06-03 -9 2 -3 3 18
2019-06-04 -4 15 8 10 -5
In [50]: df1.loc[dates[2]:dates[3],'G'] = 888 # 增加'B'列
In [51]: df1
Out[51]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 NaN
2019-06-02 1 999 5 -10 6 NaN
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
7. 缺失数据
7.1 删除所有包含缺失数据的行
In [52]: df1.dropna(how='any')
Out[52]:
A B C D E G
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
7.2 填充缺失数据
In [54]: df1.fillna(value=666)
Out[54]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 666.0
2019-06-02 1 999 5 -10 6 666.0
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
7.3 获取缺失数据掩码(缺失数据被填充后,依然能被查询到)
In [55]: pd.isnull(df1)
Out[55]:
A B C D E G
2019-06-01 False False False False False True
2019-06-02 False False False False False True
2019-06-03 False False False False False False
2019-06-04 False False False False False False
8. 数据运算df1.apply()
In [60]: df1.apply(np.cumsum) # cumsum:矩阵元素累计和
Out[60]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 NaN
2019-06-02 -4 1998 7 -12 5 NaN
2019-06-03 -13 2000 4 -9 23 888.0
2019-06-04 -17 2015 12 1 18 1776.0
In [61]: df1.apply(lambda x: x.max() - x.min()) # 自定义lambda函数,对df1各列进行同样的遍历计算
Out[61]:
A 10.0
B 997.0
C 11.0
D 20.0
E 23.0
G 0.0
dtype: float64
9. 数据合并
9.1 pd.concat()实现pandas对象的拼接
In [72]: pieces = [df1[0:2], df1[2:], df1[:1]] # pieces:3个pandas对象组成的列表
In [73]: pd.concat(pieces)
Out[73]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 NaN
2019-06-02 1 999 5 -10 6 NaN
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
2019-06-01 -5 999 2 -2 -1 NaN
9.2 pd.merge()实现pandas对象的合并(注意区别)
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [81]: pd.merge(left, right, on='key')
Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [84]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [85]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [86]: left
Out[86]:
key lval
0 foo 1
1 bar 2
In [87]: right
Out[87]:
key rval
0 foo 4
1 bar 5
In [88]: pd.merge(left, right, on='key')
Out[88]:
key lval rval
0 foo 1 4
1 bar 2 5
10. 追加数据df1.append()(行尾追加一行数据)
In [91]: s = df1.iloc[3]
In [92]: s
Out[92]:
A -4.0
B 15.0
C 8.0
D 10.0
E -5.0
G 888.0
Name: 2019-06-04 00:00:00, dtype: float64
In [93]: df1.append(s, ignore_index=True) # 返回一个新的DataFrame对象,注意数据的变化
Out[93]:
A B C D E G
0 -5.0 999.0 2.0 -2.0 -1.0 NaN
1 1.0 999.0 5.0 -10.0 6.0 NaN
2 -9.0 2.0 -3.0 3.0 18.0 888.0
3 -4.0 15.0 8.0 10.0 -5.0 888.0
4 -4.0 15.0 8.0 10.0 -5.0 888.0
In [94]: df1
Out[94]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 NaN
2019-06-02 1 999 5 -10 6 NaN
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
In [95]: df1.append(s)
Out[95]:
A B C D E G
2019-06-01 -5.0 999.0 2.0 -2.0 -1.0 NaN
2019-06-02 1.0 999.0 5.0 -10.0 6.0 NaN
2019-06-03 -9.0 2.0 -3.0 3.0 18.0 888.0
2019-06-04 -4.0 15.0 8.0 10.0 -5.0 888.0
2019-06-04 -4.0 15.0 8.0 10.0 -5.0 888.0
11. 分组操作
In [98]: df1
Out[98]:
A B C D E G
2019-06-01 -5 999 2 -2 -1 NaN
2019-06-02 1 999 5 -10 6 NaN
2019-06-03 -9 2 -3 3 18 888.0
2019-06-04 -4 15 8 10 -5 888.0
In [101]: df1.groupby('B').sum() # 对一列分组,然后对结果组应用和函数
Out[101]:
A C D E G
B
2 -9 -3 3 18 888.0
15 -4 8 10 -5 888.0
999 -4 7 -12 5 0.0
In [97]: df3
Out[97]:
A B C D
0 foo one -0.528172 1.462108
1 bar one -1.072969 -2.060141
2 foo two 0.865408 -0.322417
3 bar three -2.301539 -0.384054
4 foo two 1.744812 1.133769
5 bar two -0.761207 -1.099891
6 foo one 0.319039 -0.172428
7 foo three -0.249370 -0.877858
In [103]: df3.groupby('A').sum() # 对一列分组,然后对结果组应用和函数
Out[103]:
C D
A
bar -4.135714 -3.544086
foo 2.151716 1.223174
In [104]: df3.groupby(['A','B']).sum() # 对两列分组,然后对结果组应用和函数
Out[104]:
C D
A B
bar one -1.072969 -2.060141
three -2.301539 -0.384054
two -0.761207 -1.099891
foo one -0.209133 1.289680
three -0.249370 -0.877858
two 2.610219 0.811352
12. 数据压缩“compress”
12.1 准备工作
In [105]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
In [106]: tuples
Out[106]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [108]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [109]: index
Out[109]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
In [118]: type(index)
Out[118]: pandas.core.indexes.multi.MultiIndex
In [110]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [111]: df
Out[111]:
A B
first second
bar one 0.042214 0.582815
two -1.100619 1.144724
baz one 0.901591 0.502494
two 0.900856 -0.683728
foo one -0.122890 -0.935769
two -0.267888 0.530355
qux one -0.691661 -0.396754
two -0.687173 -0.845206
In [119]: df4 = df[:4]
In [120]: df4
Out[120]:
A B
first second
bar one 0.042214 0.582815
two -1.100619 1.144724
baz one 0.901591 0.502494
two 0.900856 -0.683728
12.2 DataFrame数据stack()
压缩到同一列
In [121]: stacked = df4.stack()
In [122]: stacked
Out[122]:
first second
bar one A 0.042214
B 0.582815
two A -1.100619
B 1.144724
baz one A 0.901591
B 0.502494
two A 0.900856
B -0.683728
dtype: float64
12.3 DataFrame数据unstack()
解压缩, stack的逆操作
In [123]: stacked.unstack()
Out[123]:
A B
first second
bar one 0.042214 0.582815
two -1.100619 1.144724
baz one 0.901591 0.502494
two 0.900856 -0.683728