文章目录
- 1.处理缺失值
- 1.1 dropna(how,axis=0,thresh)
- 1.2 fillna(value,method,axis,inplace,limit)
- 2. 数据转换
- 2.1 删除重复值
- 2.2 使用函数或映射进行数据转换
- 2.3 替代值
- 2.4 重命名轴索引
- 2.5 离散化和分箱
- 2.6 监测和过滤异常值
- 2.7 随机抽样
- 2.8 虚拟变量
- 3 字符串操作
- 3.1 字符串的基本操作
- 3.2 正则表达式
- 3.3 pandas中的向量化字符串函数
- 3.3.1 [pandas向量化字符串方法列表]()
1.处理缺失值
- pandas中默认的缺失值为NaN
- Numpy中默认的缺失值为nan
- NaN和nan是同一个东西,只不过在pandas中显示为NaN,在Numpy中显示为nan
- NaN和nan都是浮点数
- None和NaN都会被当做pd.NA来处理
- NA: not available不可用,在统计学中NA可以是不存在的数据或者存在但不可观察的数据
In [4]: string_data = pd.DataFrame([None,pd.NA,np.nan,'','zhangfan'])
In [5]: string_data.isnull()
Out[5]:
0
0 True
1 True
2 True
3 False
4 False
1.1 dropna(how,axis=0,thresh)
series.dropna()或者df.dropna()
- 默认情况下,不传入任何参数,则dropna()会删除包含缺失值的行
- drop(how=‘all’)时,表示会删除所有值均为NA的行
- dropna(thresh=x)表示保留至少有x个值不是NA的行
In [19]: df1
Out[19]:
0 1 2 3
0 -0.935335 NaN NaN 0.402039
1 0.265103 NaN NaN -0.229446
2 0.084044 NaN -0.385596 -0.460048
3 0.043280 NaN 1.498160 -0.292045
4 0.209144 NaN -1.310729 -0.376214
In [20]: df1.dropna()#全部行被删除
Out[20]:
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
In [21]: df1.dropna(how='all',axis='columns')#删除全部值为NA的列
Out[21]:
0 2 3
0 -0.935335 NaN 0.402039
1 0.265103 NaN -0.229446
2 0.084044 -0.385596 -0.460048
3 0.043280 1.498160 -0.292045
4 0.209144 -1.310729 -0.376214
In [23]: df1.dropna(thresh=3)
Out[23]:
0 1 2 3
2 0.084044 NaN -0.385596 -0.460048
3 0.043280 NaN 1.498160 -0.292045
4 0.209144 NaN -1.310729 -0.376214
1.2 fillna(value,method,axis,inplace,limit)
df.fillna()或者series.fillna()
- value:标量值或者自典型对象用于填充缺失值
- method:默认是’ffill’前向填充,还可以去’bfill’后向填充
- axis:需要填充的轴
- inplace:bool对象,是否改变原对象
- limit:用于前向填充或后向填充时最大的填充范围
In [27]: df1
Out[27]:
0 1 2 3
0 -0.935335 NaN NaN 0.402039
1 0.265103 NaN NaN -0.229446
2 0.084044 NaN -0.385596 -0.460048
3 0.043280 NaN 1.498160 -0.292045
4 0.209144 NaN -1.310729 -0.376214
In [28]: df1.fillna(0)
Out[28]:
0 1 2 3
0 -0.935335 0.0 0.000000 0.402039
1 0.265103 0.0 0.000000 -0.229446
2 0.084044 0.0 -0.385596 -0.460048
3 0.043280 0.0 1.498160 -0.292045
4 0.209144 0.0 -1.310729 -0.376214
In [29]: df1.fillna({1:1,2:2})#传入一个字典用来对每列填充不同的值
Out[29]:
0 1 2 3
0 -0.935335 1.0 2.000000 0.402039
1 0.265103 1.0 2.000000 -0.229446
2 0.084044 1.0 -0.385596 -0.460048
3 0.043280 1.0 1.498160 -0.292045
4 0.209144 1.0 -1.310729 -0.376214
In [38]: df2 ##换了一个dataframe
Out[38]:
0 1 2 3
0 -0.935335 10.0 10.000000 0.402039
1 0.265103 NaN NaN -0.229446
2 0.084044 NaN -0.385596 -0.460048
3 0.043280 NaN 1.498160 -0.292045
4 0.209144 NaN -1.310729 -0.376214
In [39]: df2.fillna(method='ffill')
Out[39]:
0 1 2 3
0 -0.935335 10.0 10.000000 0.402039
1 0.265103 10.0 10.000000 -0.229446
2 0.084044 10.0 -0.385596 -0.460048
3 0.043280 10.0 1.498160 -0.292045
4 0.209144 10.0 -1.310729 -0.376214
In [40]: df2.fillna(axis=0,method='ffill')
Out[40]:
0 1 2 3
0 -0.935335 10.0 10.000000 0.402039
1 0.265103 10.0 10.000000 -0.229446
2 0.084044 10.0 -0.385596 -0.460048
3 0.043280 10.0 1.498160 -0.292045
4 0.209144 10.0 -1.310729 -0.376214
In [41]: df2.fillna(axis=1,method='ffill')
Out[41]:
0 1 2 3
0 -0.935335 10.000000 10.000000 0.402039
1 0.265103 0.265103 0.265103 -0.229446
2 0.084044 0.084044 -0.385596 -0.460048
3 0.043280 0.043280 1.498160 -0.292045
4 0.209144 0.209144 -1.310729 -0.376214
In [43]: df2.fillna(method='ffill',limit=2)#最多只能前向填充两个缺失值
Out[43]:
0 1 2 3
0 -0.935335 10.0 10.000000 0.402039
1 0.265103 10.0 10.000000 -0.229446
2 0.084044 10.0 -0.385596 -0.460048
3 0.043280 NaN 1.498160 -0.292045
4 0.209144 NaN -1.310729 -0.376214
2. 数据转换
2.1 删除重复值
- df.duplicated()用于判断每一行是否存在重复(与之前出现过的行相同)情况
- df.drop_duplicates()可以不传入参数或者传入一个列名列表,表示基于这几个列的重复值来删除重复行
- df.drop_duplicates()默认是保留第一个出现的值,可以传入keep='last’参数来保留最后出现的值
In [46]: data
Out[46]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
In [47]: data.duplicated()
Out[47]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
dtype: bool
In [48]: data.drop_duplicates()
Out[48]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
In [49]: data.drop_duplicates(['k1','k2'])
Out[49]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
2.2 使用函数或映射进行数据转换
举例
In [50]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
...: 'Pastrami', 'corned beef', 'Bacon',
...: 'pastrami', 'honey ham', 'nova lox'],
...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
...:
In [51]: data
Out[51]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
In [52]: meat_to_animal = {
...: 'bacon': 'pig',
...: 'pulled pork': 'pig',
...: 'pastrami': 'cow',
...: 'corned beef': 'cow',
...: 'honey ham': 'pig',
...: 'nova lox': 'salmon'
...: }
In [53]:
In [53]: lowercased = data['food'].str.lower()
...:
...: lowercased
...: data['animal'] = lowercased.map(meat_to_animal)#Series的map函数还可以传入一个字典,表示将Series中的每个元素 #当做字典的键,在字典中查询相应的值
...: data
...:
...:
Out[53]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
2.3 替代值
df.replace()
In [54]: data = pd.Series([1,-999,2,-999,-1000,3])
In [55]: data.replace(-999,np.nan)#将-999替换为np.nan
Out[55]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [56]: data.replace([-999,-1000],np.nan)#将-999和-1000替换成np.nan
Out[56]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [57]: data.replace([-999,-1000],[np.nan,0])#将-999替换成np.nan,将-1000替换成0
Out[57]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
In [59]: data
Out[59]:
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
In [60]: data.replace({-999:np.nan,-1000:0})#将-999替换成np.nan,将-1000替换成0
Out[60]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
2.4 重命名轴索引
df.rename(index,columns),支持inplace参数
df.reindex()是重建索引,即选择那些索引和重新排列索引,而df.rename()是重新命名索引,不改变索引的排序
In [61]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),^M
...: index=['Ohio', 'Colorado', 'New York'],^M
...: columns=['one', 'two', 'three', 'four'])
...:
In [62]: data.rename(index=str.upper,columns=str.title)#特殊用法,不常用
Out[62]:
One Two Three Four
OHIO 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
In [63]: data.rename(index = {'Ohio':'Indiana'},columns={'three':'peekaboo'})#通过字典来rename,比较常用
Out[63]:
one two peekaboo four
Indiana 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
2.5 离散化和分箱
- pd.cut()
- pd.qcut()
举例
In [64]: ages = [20,22,25,27,21,23,37,31,61,45,41,32]
In [65]: bins = [18,25,35,60,100]
In [66]: pd.cut(ages,bins)
Out[66]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
In [67]: pd.cut(ages,bins,right=False)#默认是左开右闭,即right=False
Out[67]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
In [68]: labels = ['年轻','中年','老年','晚年']
In [69]: pd.cut(ages,bins,right=False,labels=labels)#可以设置labels,给每个区间取一个别名
Out[69]:
['年轻', '年轻', '中年', '中年', '年轻', ..., '中年', '晚年', '老年', '老年', '中年']
Length: 12
Categories (4, object): ['年轻' < '中年' < '老年' < '晚年']
In [70]: pd.cut(ages,4)#可以直接传入一个整数,表明将数据分箱为多少个区间
Out[70]:
[(19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], (19.959, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64]): [(19.959, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]
In [71]: pd.cut(ages,4,precision=2)#precision表示精度,precision=2表明保留两位小数
Out[71]:
[(19.96, 30.25], (19.96, 30.25], (19.96, 30.25], (19.96, 30.25], (19.96, 30.25], ..., (30.25, 40.5], (50.75, 61.0], (40.5, 50.75], (40.5, 50.75], (30.25, 40.5]]
Length: 12
Categories (4, interval[float64]): [(19.96, 30.25] < (30.25, 40.5] < (40.5, 50.75] < (50.75, 61.0]]
根据分位数来分箱
In [64]: ages = [20,22,25,27,21,23,37,31,61,45,41,32]
In [73]: pd.qcut(ages,4)#将数据按照四分位数来分成四份
Out[73]:
[(19.999, 22.75], (19.999, 22.75], (22.75, 29.0], (22.75, 29.0], (19.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(19.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]
In [74]: cats = pd.qcut(ages,4)
In [75]: pd.value_counts(cats)
Out[75]:
(38.0, 61.0] 3
(29.0, 38.0] 3
(22.75, 29.0] 3
(19.999, 22.75] 3
In [77]: x = pd.qcut(ages,[0.1,0.2,0.5,0.7,1])#可以传入一个表示分位数的列表
In [78]: x
Out[78]:
[NaN, (21.099, 22.2], (22.2, 29.0], (22.2, 29.0], NaN, ..., (29.0, 35.5], (35.5, 61.0], (35.5, 61.0], (35.5, 61.0], (29.0, 35.5]]
Length: 12
Categories (4, interval[float64]): [(21.099, 22.2] < (22.2, 29.0] < (29.0, 35.5] < (35.5, 61.0]]
In [79]: pd.value_counts(x)
Out[79]:
(35.5, 61.0] 4
(22.2, 29.0] 3
(29.0, 35.5] 2
(21.099, 22.2] 1
dtype: int64
2.6 监测和过滤异常值
np.sign(df),能够根据数据中的值的正负返回1或-1.
In [80]: data = np.random.randn(4,3)
In [81]: np.sign(data)
Out[81]:
array([[ 1., 1., 1.],
[ 1., -1., 1.],
[ 1., -1., -1.],
[-1., 1., 1.]])
2.7 随机抽样
df.sample()或series.sample()
In [90]: data.sample(3)#默认是不能重复抽样,因此在默认情况下,样本数不能超过行数
Out[90]:
0 1 2
0 0.637001 1.817168 0.179284
3 -0.308768 0.386770 0.247441
1 1.188007 -0.032808 1.389581
In [91]: data.sample(n =10,replace=True)#replace=True表明可以重复抽样,因此样本数可以超过行数
Out[91]:
0 1 2
0 0.637001 1.817168 0.179284
0 0.637001 1.817168 0.179284
3 -0.308768 0.386770 0.247441
3 -0.308768 0.386770 0.247441
0 0.637001 1.817168 0.179284
3 -0.308768 0.386770 0.247441
1 1.188007 -0.032808 1.389581
3 -0.308768 0.386770 0.247441
2 0.957482 -1.103823 -0.124816
3 -0.308768 0.386770 0.247441
In [92]: data.sample(5,axis=1,replace=True)
Out[92]:
1 0 2 1 0
0 1.817168 0.637001 0.179284 1.817168 0.637001
1 -0.032808 1.188007 1.389581 -0.032808 1.188007
2 -1.103823 0.957482 -0.124816 -1.103823 0.957482
3 0.386770 -0.308768 0.247441 0.386770 -0.308768
2.8 虚拟变量
pd.get_dummies()
In [94]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
In [95]: df
Out[95]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [96]: pd.get_dummies(df['key'],prefix='key')#prefix参数是为虚拟变量加上前缀
Out[96]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
3 字符串操作
3.1 字符串的基本操作
- split() 切割字符串
In [3]: val = 'zhang#fan'
In [4]: val.split('#')
Out[4]: ['zhang', 'fan']
- strip() 去掉字符串前后的空格
In [5]: val = ' zhangfan '
In [6]: val.strip()
Out[6]: 'zhangfan'
与strip()相对应的,还有rstrip()和lstrip()分别用来去除字符串右边的空格和左边的空格
In [7]: val.rstrip()
Out[7]: ' zhangfan'
In [8]: val.lstrip()
Out[8]: 'zhangfan
- 字符串.join(元组或列表)
join()方法中的元组和列表需要是包含字符串的元组和列表
In [9]: ls = ['zhang','fan']
In [10]: '##'.join(ls)
Out[10]: 'zhang##fan'
- in操作
In [11]: 'zhang' in 'zhangfan'
Out[11]: True
- string.index(字符串)
用于在string中寻找字符串,若字符串存在则返回字符串第一个字符在string中的下标,若不存在则报错
In [12]: 'zhangfan'.index('fan')
Out[12]: 5
In [13]: 'zhangfan'.index('ff')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-9106a25b3729> in <module>()
----> 1 'zhangfan'.index('ff')
ValueError: substring not found
- string.find(字符串)
用于在string中寻找字符串,若字符串存在则返回字符串第一个字符在string中的下标,若不存在则返回-1,注意与index()的区别 - string.count(字符串)
In [3]: string_data = 'abcdadbvdfascxavgfasvdcx'
In [4]: string_data.count('a')
Out[4]: 5
- string.replace(要替换的字符串,替换成的字符串)
replace会将所有的“要替换的字符串”全部替换成“替换成的字符串”
In [6]: string_data = "zhangsanzhangsanzhangsan"
In [7]: string_data.replace('san','fan')
Out[7]: 'zhangfanzhangfanzhangfan'
- string.endswith(字符串)
如果string以字符串结尾,则返回True,否则返回False - string.startswith(字符串)
如果string以字符串开始,则返回True,否则返回False - string.rfind(字符串)
在string中寻找字符串最后出现的位置,如果在string中没有找到字符串,则返回-1,否则返回在string中字符串最后一次出现的位置的首字符的下标 - string.lower()
将string全部转换为小写 - string.upper()
将string全部转换为大写 - string.ljust(整数)和string.rjust(整数)
- 参数中的整数,表示需要将string转换为多长的字符串。
- 如果整数小于原来的string的长度,则返回值还是原字符串,若整数大于原来string的长度,则在string的基础上,在前面或者后面加上空格至长度等于整数。
- ljust()表示在字符串左对齐,在后面加上空格。rjust()与此相反
In [8]: string_data
Out[8]: 'zhangsanzhangsanzhangsan'
In [9]: string_data.ljust(20)
Out[9]: 'zhangsanzhangsanzhangsan'
In [10]: string_data.ljust(30)
Out[10]: 'zhangsanzhangsanzhangsan
In [11]: string_data.rjust(30)
Out[11]: ' zhangsanzhangsanzhangsan'
3.2 正则表达式
正则表达式的用法
re库中方法的用法
3.3 pandas中的向量化字符串函数
向量化:series.str
注意点:
- pandas向量化只有Series对象有,DataFrame对象没有
- series.str向量化字符串,只有在series包含的数据类型是字符串的时候,才能够使用,如果series包含的是其他数据类型,则不能使用字符串向量化的操作
错误示范:
In [22]: data = pd.DataFrame(np.random.randn(4,3))
In [23]: data
Out[23]:
0 1 2
0 1.737271 -0.653427 0.472769
1 1.651421 -0.592218 -0.427529
2 -1.115946 1.528793 -1.038506
3 -1.257965 -2.352095 0.376387
In [24]: data.str[0]#看报错的最后一行,df对象没有str属性
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-6cd9728cfd04> in <module>()
----> 1 data.str[0]
~\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'str'
In [25]: type(data[1])#表明一列的数据类型是Series
Out[25]: pandas.core.series.Series
In [26]: data.dtypes#表明数据类型都是浮点数
Out[26]:
0 float64
1 float64
2 float64
dtype: object
In [27]: data[1].str[0]#看报错的最后一行,.str操作只能对字符串类型的值才能使用
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-27-50a2efbe44ba> in <module>()
----> 1 data[1].str[0]
###报错太长,省略了
AttributeError: Can only use .str accessor with string values!
在上面的两个注意点都满足后,series.str
就相当于一个字符串,可以切片,可以按下标索引,可以调用字符串对象的方法,甚至可以直接调用正则表达式库,re库中的方法。
In [35]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}
...:
In [37]: data = pd.Series(data)
In [38]: data
Out[38]:
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
In [39]: data.str.contains('gmail')
Out[39]:
Dave False
Steve True
Rob True
Wes NaN
dtype: object
In [40]: pattern = r'[A-Z0-9._%+-]+@[A-Z0-9._]+\.[A-Z]{2,4}'
In [41]: import re
In [42]: data.str.findall(pattern,flags=re.IGNORECASE)#直接调用了re库中的findall方法
Out[42]:
Dave [dave@google.com]
Steve [steve@gmail.com]
Rob [rob@gmail.com]
Wes NaN
dtype: object
3.3.1 pandas向量化字符串方法列表