利用Python进行数据分析——pandas入门（5）

原创

wx63a03d571f3d9 2022-12-19 18:45:21 博主文章分类：Python ©著作权

©著作权归作者所有：来自51CTO博客作者wx63a03d571f3d9的原创作品，请联系作者获取转载授权，否则将追究法律责任

2.2 pandas基本功能

2.2.1 重建索引

reindex是pandas对象的重要方法，该方法用于创建一个符合新索引的新对象。Series调用reindex方法时，会将数据按照新的索引进行排列，如果某个索引值之前并不存在，则会引入缺失值：

In [1]: import pandas as pd

In [2]: obj = pd.Series([4.5, 5.3, -8.2, 4.9], index=['a', 's', 'q', 'f'])

In [3]: obj
Out[3]:
a    4.5
s    5.3
q   -8.2
f    4.9
dtype: float64

In [6]: obj2 = obj.reindex(['a', 's', 'f', 'q', 'e'])

In [7]: obj2
Out[7]:
a    4.5
s    5.3
f    4.9
q   -8.2
e    NaN
dtype: float64

method可选参数允许我们使用诸如ffill等方法在重建索引时插值，ffill方法会将值前向填充：

In [16]: obj3
Out[16]:
0      blue
2    purple
4    yellow
dtype: object

In [17]: obj3.reindex(range(6), method='ffill')
Out[17]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd']
                     , columns=['Ohio', 'Texas', 'California'])
print(frame)
'''
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
'''

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
'''
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0
'''

2.2.2 轴向上删除条目

drop方法会返回一个含有指示值或轴向上删除值的新对象：

In [25]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [26]: obj
Out[26]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [27]: new_obj = obj.drop('c')

In [28]: new_obj
Out[28]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [29]: obj.drop(['d', 'c'])
Out[29]:
a    0.0
b    1.0
e    4.0
dtype: float64

在DataFrame中，索引值可以从轴向上删除。

In [32]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    ...: index=['Ohio', 'Colorado', 'Utah', 'New York'],
    ...: columns=['one', 'two', 'three', 'four'])

In [33]: data
Out[33]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [34]: data.drop(['Colorado', 'Ohio'])
Out[34]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [35]: data.drop('two', axis=1)
Out[35]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In [36]: data.drop(['two', 'four'], axis='columns')
Out[36]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

很多函数，例如drop，会修改Series或DataFrame的尺寸或形状，这些方法直接操作原对象而不返回新对象：

In [39]: obj.drop('c', inplace=True)

In [40]: obj
Out[40]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

请注意inplace属性，它会清除被删除的数据。

2.2.3 索引、选择与过滤

In [41]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [42]: obj
Out[42]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [43]: obj['b']
Out[43]: 1.0

In [44]: obj[2]
Out[44]: 2.0

In [45]: obj[2 : 4]
Out[45]:
c    2.0
d    3.0
dtype: float64

In [46]: obj[obj < 2]
Out[46]:
a    0.0
b    1.0
dtype: float64

In [48]: obj[[1, 3]]
Out[48]:
b    1.0
d    3.0
dtype: float64

In [54]: data # 上面已经定义好
Out[54]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [55]: data['three']
Out[55]:
Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int32

In [56]: data[['three', 'four']]
Out[56]:
          three  four
Ohio          2     3
Colorado      6     7
Utah         10    11
New York     14    15

In [57]: data[ : 2]
Out[57]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

In [59]: data[data['three'] > 5]
Out[59]:
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [60]: data < 5 # 返回bool类型
Out[60]:
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [61]: data[data < 5] = 0 # 将所有小于5的数值 赋值为 0 

In [62]: data
Out[62]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

2.2.3.1 使用loc和iloc选择数据

针对DataFrame在行上的标签索引，使用特殊的索引符号loc和iloc。允许使用轴标签（loc） 或 整数标签（iloc） 以NumPy风格的语法从DataFrame中选出数组的行和列的子集。

In [68]: data
Out[68]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [69]: data.loc['Colorado', ['two', 'three']]
Out[69]:
two      5
three    6
Name: Colorado, dtype: int32

In [70]: data.iloc[2, [3, 0, 1]]
Out[70]:
four    11
one      8
two      9
Name: Utah, dtype: int32

In [71]: data.iloc[2]
Out[71]:
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [72]: data.iloc[[1,2], [3,0,1]]
Out[72]:
          four  one  two
Colorado     7    4    5
Utah        11    8    9

In [74]: data.loc[: 'Utah', 'two'] # 
Out[74]:
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [76]: data.iloc[:, :3][data.three > 5]
Out[76]:
          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

2.2.4 整数索引

不使用轴索引会出错：

In [77]: ser = pd.Series(np.arange(3.))

In [78]: ser
Out[78]:
0    0.0
1    1.0
2    2.0
dtype: float64

In [79]: ser[-1] # 整数索引就会报错
---------------------------------------------------------------------------
KeyError


In [80]: ser
Out[80]:
0    0.0
1    1.0
2    2.0
dtype: float64

In [81]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [82]: ser2
Out[82]:
a    0.0
b    1.0
c    2.0
dtype: float64

In [83]: ser2[-1] # 非整数索引 不会报错
Out[83]: 2.0

In [102]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c']) # 非整数索引不会产生歧义

In [103]: ser2[-1]
Out[103]: 2.0

In [104]: ser[:1]
Out[104]:
0    0.0
dtype: float64

In [105]: ser.loc[:1]
Out[105]:
0    0.0
1    1.0
dtype: float64

In [106]: ser.iloc[:1]
Out[106]:
0    0.0
dtype: float64

2.2.5 算术和数据对齐

当你将对象相加时，如果存在某个索引对不相同，则返回结果的索引将是索引对的并集。没有交叠的标签位置上，内部数据对齐会产生缺失值。缺失值会在后续的算术操作上产生影响。

In [108]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5],
     ...: index=['a', 'b', 'd', 'e'])

In [109]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e','f', 'g'])

In [110]: s1
Out[110]:
a    7.3
b   -2.5
d    3.4
e    1.5
dtype: float64

In [111]: s2
Out[111]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [112]: s1 + s2
Out[112]:
a    5.2
b    NaN
c    NaN
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

由于’c’列和’e’列并不是两个DataFrame共有的列，这两列中产生了缺失值。对于行标签不同的DataFrame对象也是如此。

In [113]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
     ...: index=['Ohio', 'Texas', 'Colorado'])

In [116]: df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde',
     ...: ), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: df1
Out[117]:
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [118]: df2
Out[118]:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [119]: df1 +  df2
Out[119]:
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

2.2.6 函数应用和映射

NumPy的通用函数（逐元素数组方法）对pandas对象也有效：

In [121]: frame
Out[121]:
               b         d         e
Utah   -1.349812 -0.962359 -0.947875
Ohio   -0.226425  0.601588  0.045817
Texas   0.594069  0.205601  1.024613
Oregon  0.566535  0.249397  1.449775

In [122]: np.abs(frame)
Out[122]:
               b         d         e
Utah    1.349812  0.962359  0.947875
Ohio    0.226425  0.601588  0.045817
Texas   0.594069  0.205601  1.024613
Oregon  0.566535  0.249397  1.449775

另一个常用的操作是将函数应用到一行或一列的一维数组上。DataFrame的apply方法可以实现这个功能：

In [123]: f = lambda x : x.max() - x.min()

In [124]: frame.apply(f)
Out[124]:
b    1.943881
d    1.563947
e    2.397650
dtype: float64

In [125]: frame.apply(f, axis='columns')
Out[125]:
Utah      0.401937
Ohio      0.828013
Texas     0.819013
Oregon    1.200378
dtype: float64

这里的函数f，可以计算Series最大值和最小值的差，会被frame中的每一列调用一次。结果是一个以frame的列作为索引的Series。

2.2.7 排序和排名

如需按行或列索引进行字典型排序，需要使用sort_index方法，该方法返回一个新的、排序好的对象：

In [126]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [127]: obj.sort_index()
Out[127]:
a    1
b    2
c    3
d    0
dtype: int32


In [129]: frame = pd.DataFrame(np.arange(8).reshape((2,4)),
     ...: index=['three', 'one'],
     ...: columns=['d', 'a', 'b', 'c'])

In [130]: frame
Out[130]:
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [131]: frame.sort_index()
Out[131]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [132]: frame.sort_index(axis=1)
Out[132]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

In [133]: frame.sort_index(axis=1, ascending=False)
Out[133]:
       d  c  b  a
three  0  3  2  1
one    4  7  6  5