2.2 pandas基本功能

2.2.1 重建索引

  ​​reindex​​​是​​pandas​​​对象的重要方法,该方法用于创建一个符合新索引的新对象。​​Series​​​调用​​reindex​​方法时,会将数据按照新的索引进行排列,如果某个索引值之前并不存在,则会引入缺失值:

In [1]: import pandas as pd

In [2]: obj = pd.Series([4.5, 5.3, -8.2, 4.9], index=['a', 's', 'q', 'f'])

In [3]: obj
Out[3]:
a 4.5
s 5.3
q -8.2
f 4.9
dtype: float64

In [6]: obj2 = obj.reindex(['a', 's', 'f', 'q', 'e'])

In [7]: obj2
Out[7]:
a 4.5
s 5.3
f 4.9
q -8.2
e NaN
dtype: float64

  ​​method​​​可选参数允许我们使用诸如ffill等方法在重建索引时插值,​​ffill​​方法会将值前向填充:

In [16]: obj3
Out[16]:
0 blue
2 purple
4 yellow
dtype: object

In [17]: obj3.reindex(range(6), method='ffill')
Out[17]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd']
, columns=['Ohio', 'Texas', 'California'])
print(frame)
'''
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
'''

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
'''
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
'''

2.2.2 轴向上删除条目

  ​​drop​​方法会返回一个含有指示值或轴向上删除值的新对象:

In [25]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [26]: obj
Out[26]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

In [27]: new_obj = obj.drop('c')

In [28]: new_obj
Out[28]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64

In [29]: obj.drop(['d', 'c'])
Out[29]:
a 0.0
b 1.0
e 4.0
dtype: float64

  在​​DataFrame​​中,索引值可以从轴向上删除。

In [32]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
...: index=['Ohio', 'Colorado', 'Utah', 'New York'],
...: columns=['one', 'two', 'three', 'four'])

In [33]: data
Out[33]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [34]: data.drop(['Colorado', 'Ohio'])
Out[34]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15

In [35]: data.drop('two', axis=1)
Out[35]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15

In [36]: data.drop(['two', 'four'], axis='columns')
Out[36]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

  很多函数,例如​​drop​​​,会修改​​Series​​​或​​DataFrame​​的尺寸或形状,这些方法直接操作原对象而不返回新对象:

In [39]: obj.drop('c', inplace=True)

In [40]: obj
Out[40]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64

  请注意​​inplace​​属性,它会清除被删除的数据。

2.2.3 索引、选择与过滤

In [41]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [42]: obj
Out[42]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64

In [43]: obj['b']
Out[43]: 1.0

In [44]: obj[2]
Out[44]: 2.0

In [45]: obj[2 : 4]
Out[45]:
c 2.0
d 3.0
dtype: float64

In [46]: obj[obj < 2]
Out[46]:
a 0.0
b 1.0
dtype: float64

In [48]: obj[[1, 3]]
Out[48]:
b 1.0
d 3.0
dtype: float64
In [54]: data # 上面已经定义好
Out[54]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [55]: data['three']
Out[55]:
Ohio 2
Colorado 6
Utah 10
New York 14
Name: three, dtype: int32

In [56]: data[['three', 'four']]
Out[56]:
three four
Ohio 2 3
Colorado 6 7
Utah 10 11
New York 14 15

In [57]: data[ : 2]
Out[57]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7

In [59]: data[data['three'] > 5]
Out[59]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [60]: data < 5 # 返回bool类型
Out[60]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False

In [61]: data[data < 5] = 0 # 将所有小于5的数值 赋值为 0

In [62]: data
Out[62]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

2.2.3.1 使用loc和iloc选择数据

  针对​​DataFrame​​​在行上的标签索引,使用特殊的索引符号​​loc​​​和​​iloc​​。允许使用轴标签(loc)整数标签(iloc) 以​​NumPy​​​风格的语法从​​DataFrame​​中选出数组的行和列的子集。

In [68]: data
Out[68]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [69]: data.loc['Colorado', ['two', 'three']]
Out[69]:
two 5
three 6
Name: Colorado, dtype: int32

In [70]: data.iloc[2, [3, 0, 1]]
Out[70]:
four 11
one 8
two 9
Name: Utah, dtype: int32

In [71]: data.iloc[2]
Out[71]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32

In [72]: data.iloc[[1,2], [3,0,1]]
Out[72]:
four one two
Colorado 7 4 5
Utah 11 8 9

In [74]: data.loc[: 'Utah', 'two'] #
Out[74]:
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int32

In [76]: data.iloc[:, :3][data.three > 5]
Out[76]:
one two three
Colorado 4 5 6
Utah 8 9 10
New York 12 13 14

2.2.4 整数索引

  不使用轴索引会出错:

In [77]: ser = pd.Series(np.arange(3.))

In [78]: ser
Out[78]:
0 0.0
1 1.0
2 2.0
dtype: float64

In [79]: ser[-1] # 整数索引就会报错
---------------------------------------------------------------------------
KeyError


In [80]: ser
Out[80]:
0 0.0
1 1.0
2 2.0
dtype: float64

In [81]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [82]: ser2
Out[82]:
a 0.0
b 1.0
c 2.0
dtype: float64

In [83]: ser2[-1] # 非整数索引 不会报错
Out[83]: 2.0

In [102]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c']) # 非整数索引不会产生歧义

In [103]: ser2[-1]
Out[103]: 2.0

In [104]: ser[:1]
Out[104]:
0 0.0
dtype: float64

In [105]: ser.loc[:1]
Out[105]:
0 0.0
1 1.0
dtype: float64

In [106]: ser.iloc[:1]
Out[106]:
0 0.0
dtype: float64

2.2.5 算术和数据对齐

  当你将对象相加时,如果存在某个索引对不相同,则返回结果的索引将是索引对的并集。没有交叠的标签位置上,内部数据对齐会产生缺失值。缺失值会在后续的算术操作上产生影响。

In [108]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5],
...: index=['a', 'b', 'd', 'e'])

In [109]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e','f', 'g'])

In [110]: s1
Out[110]:
a 7.3
b -2.5
d 3.4
e 1.5
dtype: float64

In [111]: s2
Out[111]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64

In [112]: s1 + s2
Out[112]:
a 5.2
b NaN
c NaN
d NaN
e 0.0
f NaN
g NaN
dtype: float64

  由于​​’c’​​​列和​​’e’​​​列并不是两个​​DataFrame​​​共有的列,这两列中产生了缺失值。对于行标签不同的​​DataFrame​​对象也是如此。

In [113]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
...: index=['Ohio', 'Texas', 'Colorado'])

In [116]: df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde',
...: ), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: df1
Out[117]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0

In [118]: df2
Out[118]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

In [119]: df1 + df2
Out[119]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

2.2.6 函数应用和映射

  ​​NumPy​​​的通用函数(逐元素数组方法)对​​pandas​​对象也有效:

In [121]: frame
Out[121]:
b d e
Utah -1.349812 -0.962359 -0.947875
Ohio -0.226425 0.601588 0.045817
Texas 0.594069 0.205601 1.024613
Oregon 0.566535 0.249397 1.449775

In [122]: np.abs(frame)
Out[122]:
b d e
Utah 1.349812 0.962359 0.947875
Ohio 0.226425 0.601588 0.045817
Texas 0.594069 0.205601 1.024613
Oregon 0.566535 0.249397 1.449775

  另一个常用的操作是将函数应用到一行或一列的一维数组上。​​DataFrame​​​的​​apply​​方法可以实现这个功能:

In [123]: f = lambda x : x.max() - x.min()

In [124]: frame.apply(f)
Out[124]:
b 1.943881
d 1.563947
e 2.397650
dtype: float64

In [125]: frame.apply(f, axis='columns')
Out[125]:
Utah 0.401937
Ohio 0.828013
Texas 0.819013
Oregon 1.200378
dtype: float64

  这里的函数​​f​​​,可以计算​​Series​​​最大值和最小值的差,会被​​frame​​​中的每一列调用一次。结果是一个以​​frame​​​的列作为索引的​​Series​​。

2.2.7 排序和排名

  如需按行或列索引进行字典型排序,需要使用​​sort_index​​方法,该方法返回一个新的、排序好的对象:

In [126]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [127]: obj.sort_index()
Out[127]:
a 1
b 2
c 3
d 0
dtype: int32


In [129]: frame = pd.DataFrame(np.arange(8).reshape((2,4)),
...: index=['three', 'one'],
...: columns=['d', 'a', 'b', 'c'])

In [130]: frame
Out[130]:
d a b c
three 0 1 2 3
one 4 5 6 7

In [131]: frame.sort_index()
Out[131]:
d a b c
one 4 5 6 7
three 0 1 2 3

In [132]: frame.sort_index(axis=1)
Out[132]:
a b c d
three 1 2 3 0
one 5 6 7 4

In [133]: frame.sort_index(axis=1, ascending=False)
Out[133]:
d c b a
three 0 3 2 1
one 4 7 6 5