文章目录

  • 1.处理缺失值
  • 1.1 dropna(how,axis=0,thresh)
  • 1.2 fillna(value,method,axis,inplace,limit)
  • 2. 数据转换
  • 2.1 删除重复值
  • 2.2 使用函数或映射进行数据转换
  • 2.3 替代值
  • 2.4 重命名轴索引
  • 2.5 离散化和分箱
  • 2.6 监测和过滤异常值
  • 2.7 随机抽样
  • 2.8 虚拟变量
  • 3 字符串操作
  • 3.1 字符串的基本操作
  • 3.2 正则表达式
  • 3.3 pandas中的向量化字符串函数
  • 3.3.1 [pandas向量化字符串方法列表]()


1.处理缺失值

  • pandas中默认的缺失值为NaN
  • Numpy中默认的缺失值为nan
  • NaN和nan是同一个东西,只不过在pandas中显示为NaN,在Numpy中显示为nan
  • NaN和nan都是浮点数
  • None和NaN都会被当做pd.NA来处理
  • NA: not available不可用,在统计学中NA可以是不存在的数据或者存在但不可观察的数据
In [4]: string_data = pd.DataFrame([None,pd.NA,np.nan,'','zhangfan'])

In [5]: string_data.isnull()
Out[5]:
       0
0   True
1   True
2   True
3  False
4  False
1.1 dropna(how,axis=0,thresh)

series.dropna()或者df.dropna()

  • 默认情况下,不传入任何参数,则dropna()会删除包含缺失值的行
  • drop(how=‘all’)时,表示会删除所有值均为NA的行
  • dropna(thresh=x)表示保留至少有x个值不是NA的行
In [19]: df1
Out[19]:
          0   1         2         3
0 -0.935335 NaN       NaN  0.402039
1  0.265103 NaN       NaN -0.229446
2  0.084044 NaN -0.385596 -0.460048
3  0.043280 NaN  1.498160 -0.292045
4  0.209144 NaN -1.310729 -0.376214

In [20]: df1.dropna()#全部行被删除
Out[20]:
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
    
In [21]: df1.dropna(how='all',axis='columns')#删除全部值为NA的列
Out[21]:
          0         2         3
0 -0.935335       NaN  0.402039
1  0.265103       NaN -0.229446
2  0.084044 -0.385596 -0.460048
3  0.043280  1.498160 -0.292045
4  0.209144 -1.310729 -0.376214

In [23]: df1.dropna(thresh=3)
Out[23]:
          0   1         2         3
2  0.084044 NaN -0.385596 -0.460048
3  0.043280 NaN  1.498160 -0.292045
4  0.209144 NaN -1.310729 -0.376214
1.2 fillna(value,method,axis,inplace,limit)

df.fillna()或者series.fillna()

  • value:标量值或者自典型对象用于填充缺失值
  • method:默认是’ffill’前向填充,还可以去’bfill’后向填充
  • axis:需要填充的轴
  • inplace:bool对象,是否改变原对象
  • limit:用于前向填充或后向填充时最大的填充范围
In [27]: df1
Out[27]:
          0   1         2         3
0 -0.935335 NaN       NaN  0.402039
1  0.265103 NaN       NaN -0.229446
2  0.084044 NaN -0.385596 -0.460048
3  0.043280 NaN  1.498160 -0.292045
4  0.209144 NaN -1.310729 -0.376214

In [28]: df1.fillna(0)
Out[28]:
          0    1         2         3
0 -0.935335  0.0  0.000000  0.402039
1  0.265103  0.0  0.000000 -0.229446
2  0.084044  0.0 -0.385596 -0.460048
3  0.043280  0.0  1.498160 -0.292045
4  0.209144  0.0 -1.310729 -0.376214

In [29]: df1.fillna({1:1,2:2})#传入一个字典用来对每列填充不同的值
Out[29]:
          0    1         2         3
0 -0.935335  1.0  2.000000  0.402039
1  0.265103  1.0  2.000000 -0.229446
2  0.084044  1.0 -0.385596 -0.460048
3  0.043280  1.0  1.498160 -0.292045
4  0.209144  1.0 -1.310729 -0.376214


In [38]: df2						##换了一个dataframe
Out[38]:
          0     1          2         3
0 -0.935335  10.0  10.000000  0.402039
1  0.265103   NaN        NaN -0.229446
2  0.084044   NaN  -0.385596 -0.460048
3  0.043280   NaN   1.498160 -0.292045
4  0.209144   NaN  -1.310729 -0.376214

In [39]: df2.fillna(method='ffill')
Out[39]:
          0     1          2         3
0 -0.935335  10.0  10.000000  0.402039
1  0.265103  10.0  10.000000 -0.229446
2  0.084044  10.0  -0.385596 -0.460048
3  0.043280  10.0   1.498160 -0.292045
4  0.209144  10.0  -1.310729 -0.376214

In [40]: df2.fillna(axis=0,method='ffill')
Out[40]:
          0     1          2         3
0 -0.935335  10.0  10.000000  0.402039
1  0.265103  10.0  10.000000 -0.229446
2  0.084044  10.0  -0.385596 -0.460048
3  0.043280  10.0   1.498160 -0.292045
4  0.209144  10.0  -1.310729 -0.376214

In [41]: df2.fillna(axis=1,method='ffill')
Out[41]:
          0          1          2         3
0 -0.935335  10.000000  10.000000  0.402039
1  0.265103   0.265103   0.265103 -0.229446
2  0.084044   0.084044  -0.385596 -0.460048
3  0.043280   0.043280   1.498160 -0.292045
4  0.209144   0.209144  -1.310729 -0.376214

In [43]: df2.fillna(method='ffill',limit=2)#最多只能前向填充两个缺失值
Out[43]:
          0     1          2         3
0 -0.935335  10.0  10.000000  0.402039
1  0.265103  10.0  10.000000 -0.229446
2  0.084044  10.0  -0.385596 -0.460048
3  0.043280   NaN   1.498160 -0.292045
4  0.209144   NaN  -1.310729 -0.376214

2. 数据转换

2.1 删除重复值
  • df.duplicated()用于判断每一行是否存在重复(与之前出现过的行相同)情况
  • df.drop_duplicates()可以不传入参数或者传入一个列名列表,表示基于这几个列的重复值来删除重复行
  • df.drop_duplicates()默认是保留第一个出现的值,可以传入keep='last’参数来保留最后出现的值
In [46]: data
Out[46]:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6

In [47]: data.duplicated()
Out[47]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [48]: data.drop_duplicates()
Out[48]:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6

In [49]: data.drop_duplicates(['k1','k2'])
Out[49]:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
2.2 使用函数或映射进行数据转换

举例

In [50]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
    ...:                               'Pastrami', 'corned beef', 'Bacon',
    ...:                               'pastrami', 'honey ham', 'nova lox'],
    ...:                      'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
    ...:

In [51]: data
Out[51]:
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

In [52]: meat_to_animal = {
    ...:   'bacon': 'pig',
    ...:   'pulled pork': 'pig',
    ...:   'pastrami': 'cow',
    ...:   'corned beef': 'cow',
    ...:   'honey ham': 'pig',
    ...:   'nova lox': 'salmon'
    ...: }

In [53]:

In [53]: lowercased = data['food'].str.lower()
    ...: 
    ...: lowercased
    ...: data['animal'] = lowercased.map(meat_to_animal)#Series的map函数还可以传入一个字典,表示将Series中的每个元素															#当做字典的键,在字典中查询相应的值
    ...: data
    ...:
    ...:
Out[53]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon
2.3 替代值

df.replace()

In [54]: data = pd.Series([1,-999,2,-999,-1000,3])

In [55]: data.replace(-999,np.nan)#将-999替换为np.nan
Out[55]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [56]: data.replace([-999,-1000],np.nan)#将-999和-1000替换成np.nan
Out[56]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [57]: data.replace([-999,-1000],[np.nan,0])#将-999替换成np.nan,将-1000替换成0
Out[57]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [59]: data
Out[59]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [60]: data.replace({-999:np.nan,-1000:0})#将-999替换成np.nan,将-1000替换成0
Out[60]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
2.4 重命名轴索引

df.rename(index,columns),支持inplace参数

df.reindex()是重建索引,即选择那些索引和重新排列索引,而df.rename()是重新命名索引,不改变索引的排序

In [61]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),^M
    ...:                     index=['Ohio', 'Colorado', 'New York'],^M
    ...:                     columns=['one', 'two', 'three', 'four'])
    ...:

In [62]: data.rename(index=str.upper,columns=str.title)#特殊用法,不常用
Out[62]:
          One  Two  Three  Four
OHIO        0    1      2     3
COLORADO    4    5      6     7
NEW YORK    8    9     10    11

In [63]: data.rename(index = {'Ohio':'Indiana'},columns={'three':'peekaboo'})#通过字典来rename,比较常用
Out[63]:
          one  two  peekaboo  four
Indiana     0    1         2     3
Colorado    4    5         6     7
New York    8    9        10    11
2.5 离散化和分箱
  • pd.cut()
  • pd.qcut()

举例

In [64]: ages = [20,22,25,27,21,23,37,31,61,45,41,32]

In [65]: bins = [18,25,35,60,100]

In [66]: pd.cut(ages,bins)
Out[66]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [67]: pd.cut(ages,bins,right=False)#默认是左开右闭,即right=False
Out[67]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [68]: labels = ['年轻','中年','老年','晚年']

In [69]: pd.cut(ages,bins,right=False,labels=labels)#可以设置labels,给每个区间取一个别名
Out[69]:
['年轻', '年轻', '中年', '中年', '年轻', ..., '中年', '晚年', '老年', '老年', '中年']
Length: 12
Categories (4, object): ['年轻' < '中年' < '老年' < '晚年']
                                                         
In [70]: pd.cut(ages,4)#可以直接传入一个整数,表明将数据分箱为多少个区间
Out[70]:
[(19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64]): [(19.959, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]

In [71]: pd.cut(ages,4,precision=2)#precision表示精度,precision=2表明保留两位小数
Out[71]:
[(19.96, 30.25], (19.96, 30.25], (19.96, 30.25], (19.96, 30.25], (19.96, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64]): [(19.96, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]

根据分位数来分箱

In [64]: ages = [20,22,25,27,21,23,37,31,61,45,41,32]

In [73]: pd.qcut(ages,4)#将数据按照四分位数来分成四份
Out[73]:
[(19.999, 22.75], (19.999, 22.75], (22.75, 29.0], (22.75, 29.0], (19.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(19.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
                                                                                      
In [74]: cats = pd.qcut(ages,4)

In [75]: pd.value_counts(cats)
Out[75]:
(38.0, 61.0]       3
(29.0, 38.0]       3
(22.75, 29.0]      3
(19.999, 22.75]    3
 
 
In [77]: x = pd.qcut(ages,[0.1,0.2,0.5,0.7,1])#可以传入一个表示分位数的列表

In [78]: x
Out[78]:
[NaN, (21.099, 22.2], (22.2, 29.0], (22.2, 29.0], NaN, ..., (29.0, 35.5], (35.5, 61.0], (35.5, 61.0], (35.5, 61.0], (29.0, 35.5]]
Length: 12
Categories (4, interval[float64]): [(21.099, 22.2] < (22.2, 29.0] < (29.0, 35.5] < (35.5, 61.0]]

In [79]: pd.value_counts(x)
Out[79]:
(35.5, 61.0]      4
(22.2, 29.0]      3
(29.0, 35.5]      2
(21.099, 22.2]    1
dtype: int64
2.6 监测和过滤异常值

np.sign(df),能够根据数据中的值的正负返回1或-1.

In [80]: data  = np.random.randn(4,3)

In [81]: np.sign(data)
Out[81]:
array([[ 1.,  1.,  1.],
       [ 1., -1.,  1.],
       [ 1., -1., -1.],
       [-1.,  1.,  1.]])
2.7 随机抽样

df.sample()或series.sample()

In [90]: data.sample(3)#默认是不能重复抽样,因此在默认情况下,样本数不能超过行数
Out[90]:
          0         1         2
0  0.637001  1.817168  0.179284
3 -0.308768  0.386770  0.247441
1  1.188007 -0.032808  1.389581

In [91]: data.sample(n =10,replace=True)#replace=True表明可以重复抽样,因此样本数可以超过行数
Out[91]:
          0         1         2
0  0.637001  1.817168  0.179284
0  0.637001  1.817168  0.179284
3 -0.308768  0.386770  0.247441
3 -0.308768  0.386770  0.247441
0  0.637001  1.817168  0.179284
3 -0.308768  0.386770  0.247441
1  1.188007 -0.032808  1.389581
3 -0.308768  0.386770  0.247441
2  0.957482 -1.103823 -0.124816
3 -0.308768  0.386770  0.247441

In [92]: data.sample(5,axis=1,replace=True)
Out[92]:
          1         0         2         1         0
0  1.817168  0.637001  0.179284  1.817168  0.637001
1 -0.032808  1.188007  1.389581 -0.032808  1.188007
2 -1.103823  0.957482 -0.124816 -1.103823  0.957482
3  0.386770 -0.308768  0.247441  0.386770 -0.308768
2.8 虚拟变量

pd.get_dummies()

In [94]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})

In [95]: df
Out[95]:
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

In [96]: pd.get_dummies(df['key'],prefix='key')#prefix参数是为虚拟变量加上前缀
Out[96]:
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0

3 字符串操作

3.1 字符串的基本操作
  • split() 切割字符串
In [3]: val = 'zhang#fan'

In [4]: val.split('#')
Out[4]: ['zhang', 'fan']
  • strip() 去掉字符串前后的空格
In [5]: val = '  zhangfan    '

In [6]: val.strip()
Out[6]: 'zhangfan'

与strip()相对应的,还有rstrip()和lstrip()分别用来去除字符串右边的空格和左边的空格

In [7]: val.rstrip()
Out[7]: '  zhangfan'

In [8]: val.lstrip()
Out[8]: 'zhangfan
  • 字符串.join(元组或列表)
    join()方法中的元组和列表需要是包含字符串的元组和列表
In [9]: ls = ['zhang','fan']

In [10]: '##'.join(ls)
Out[10]: 'zhang##fan'
  • in操作
In [11]: 'zhang' in 'zhangfan'
Out[11]: True
  • string.index(字符串)
    用于在string中寻找字符串,若字符串存在则返回字符串第一个字符在string中的下标,若不存在则报错
In [12]: 'zhangfan'.index('fan')
Out[12]: 5

In [13]: 'zhangfan'.index('ff')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-9106a25b3729> in <module>()
----> 1 'zhangfan'.index('ff')

ValueError: substring not found
  • string.find(字符串)
    用于在string中寻找字符串,若字符串存在则返回字符串第一个字符在string中的下标,若不存在则返回-1,注意与index()的区别
  • string.count(字符串)
In [3]: string_data = 'abcdadbvdfascxavgfasvdcx'

In [4]: string_data.count('a')
Out[4]: 5
  • string.replace(要替换的字符串,替换成的字符串)
    replace会将所有的“要替换的字符串”全部替换成“替换成的字符串”
In [6]: string_data = "zhangsanzhangsanzhangsan"

In [7]: string_data.replace('san','fan')
Out[7]: 'zhangfanzhangfanzhangfan'
  • string.endswith(字符串)
    如果string以字符串结尾,则返回True,否则返回False
  • string.startswith(字符串)
    如果string以字符串开始,则返回True,否则返回False
  • string.rfind(字符串)
    在string中寻找字符串最后出现的位置,如果在string中没有找到字符串,则返回-1,否则返回在string中字符串最后一次出现的位置的首字符的下标
  • string.lower()
    将string全部转换为小写
  • string.upper()
    将string全部转换为大写
  • string.ljust(整数)和string.rjust(整数)
  • 参数中的整数,表示需要将string转换为多长的字符串。
  • 如果整数小于原来的string的长度,则返回值还是原字符串,若整数大于原来string的长度,则在string的基础上,在前面或者后面加上空格至长度等于整数。
  • ljust()表示在字符串左对齐,在后面加上空格。rjust()与此相反
In [8]: string_data
Out[8]: 'zhangsanzhangsanzhangsan'

In [9]: string_data.ljust(20)
Out[9]: 'zhangsanzhangsanzhangsan'

In [10]: string_data.ljust(30)
Out[10]: 'zhangsanzhangsanzhangsan
    
In [11]: string_data.rjust(30)
Out[11]: '      zhangsanzhangsanzhangsan'
3.2 正则表达式

正则表达式的用法

re库中方法的用法

3.3 pandas中的向量化字符串函数

向量化:series.str

注意点:

  • pandas向量化只有Series对象有,DataFrame对象没有
  • series.str向量化字符串,只有在series包含的数据类型是字符串的时候,才能够使用,如果series包含的是其他数据类型,则不能使用字符串向量化的操作
    错误示范:
In [22]: data = pd.DataFrame(np.random.randn(4,3))

In [23]: data
Out[23]:
          0         1         2
0  1.737271 -0.653427  0.472769
1  1.651421 -0.592218 -0.427529
2 -1.115946  1.528793 -1.038506
3 -1.257965 -2.352095  0.376387

In [24]: data.str[0]#看报错的最后一行,df对象没有str属性
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-6cd9728cfd04> in <module>()
----> 1 data.str[0]

~\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5138                 return self[name]
-> 5139             return object.__getattribute__(self, name)
   5140
   5141     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'str'

In [25]: type(data[1])#表明一列的数据类型是Series
Out[25]: pandas.core.series.Series
    
In [26]: data.dtypes#表明数据类型都是浮点数
Out[26]:
0    float64
1    float64
2    float64
dtype: object

In [27]: data[1].str[0]#看报错的最后一行,.str操作只能对字符串类型的值才能使用
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-50a2efbe44ba> in <module>()
----> 1 data[1].str[0]

###报错太长,省略了

AttributeError: Can only use .str accessor with string values!

在上面的两个注意点都满足后,series.str就相当于一个字符串,可以切片,可以按下标索引,可以调用字符串对象的方法,甚至可以直接调用正则表达式库,re库中的方法。

In [35]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
    ...:         'Rob': 'rob@gmail.com', 'Wes': np.nan}
    ...:

In [37]: data = pd.Series(data)

In [38]: data
Out[38]:
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [39]: data.str.contains('gmail')
Out[39]:
Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object
    
In [40]: pattern = r'[A-Z0-9._%+-]+@[A-Z0-9._]+\.[A-Z]{2,4}'

In [41]: import re

In [42]: data.str.findall(pattern,flags=re.IGNORECASE)#直接调用了re库中的findall方法
Out[42]:
Dave     [dave@google.com]
Steve    [steve@gmail.com]
Rob        [rob@gmail.com]
Wes                    NaN
dtype: object
3.3.1 pandas向量化字符串方法列表