文章目录




1. pandas介绍

Pandas 的数据结构:Pandas 主要有 Series(一维数组),DataFrame(二维数组),Panel(三维数组),Panel4D(四维数组),PanelND(更多维数组)等数据结构。其中 Series 和 DataFrame 应用的最为广泛。
Pandas 常用的数据结构有两种:​​​Series​​​ 和 ​​DataFrame​​。这些数据结构构建在 Numpy 数组之上,这意味着它们效率很高。

python版本:3.5>=python

导入相关库

import numpy as np
import pandas as pd

2. 方法

2.1 Series

​Series​​ 是一个带有 名称 和索引的一维数组,既然是数组,肯定要说到的就是数组中的元素类型,在 Series 中包含的数据类型可以是整数、浮点、字符串、Python对象等。

2.1.1 传递list对象创建一个 Series

pandas会默认创建整型索引

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s= pd.Series([1,3,5,np.nan,6,8])
print(s)

//输出

0    1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

2.1.2 传递numpy array时间索引以及列标签来创建DataFrame

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s= pd.Series([1,3,5,np.nan,6,8])
# print(s)
dates = pd.date_range('20170801',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(df)

//

A         B         C         D
2017-08-01 -0.352464 0.207259 -0.379499 -0.040017
2017-08-02 -1.400304 -1.609741 1.149181 0.353892
2017-08-03 -0.244790 0.343552 -2.269876 -0.832983
2017-08-04 0.817258 -0.256027 0.917888 -0.240781
2017-08-05 0.312557 -1.103468 1.140370 -0.097515
2017-08-06 -0.945031 0.178114 0.099680 -0.993974

2.1.3 传递被转换成类似序列结构的字典对象来创建DataFrame

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df2 = pd.DataFrame({
'A': 1,
'B':pd.Timestamp('20170801'),
'C':pd.Series(1,index=list(range(4)),dtype='float64'),
'D':np.array([3]*4,dtype='int64'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'fooo'
})




print(df2)

//

A          B    C  D      E     F
0 1 2017-08-01 1.0 3 test fooo
1 1 2017-08-01 1.0 3 train fooo
2 1 2017-08-01 1.0 3 test fooo
3 1 2017-08-01 1.0 3 train fooo

查看不同列的数据类型

print(df2.dtype)

//

A             int64
B datetime64[ns]
C float64
D int64
E category
F object
dtype: object

2.2 查看数据

2.2.1 查看frame中头部和尾部的行

df.head(3)
df.tail(3)

2.2.2 显示索引、列和底层的numpy数据。

print(df.index)
print(df.columns)
print(df.values)

//

DatetimeIndex(['2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04',
'2017-08-05', '2017-08-06'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[-0.99340137 -0.12548833 -1.08581538 1.17112124]
[ 0.29427792 0.20182606 -0.84427908 -1.42783808]
[-0.06528316 1.55255719 1.15818104 0.05286832]
[ 1.49223138 -1.11517738 2.01386582 0.3496034 ]
[ 0.23697223 0.43241392 -0.97061103 0.79105226]
[ 2.96518919 0.8754193 -1.26479748 0.29930427]]

2.2.3 describe()函数对数据快速的统计汇总:

print(describe())

//

A         B         C         D
count 6.000000 6.000000 6.000000 6.000000
mean 0.396478 0.089271 -0.006982 0.281260
std 1.342037 1.158896 0.513877 0.997442
min -1.395445 -1.287098 -1.047989 -1.062651
25% -0.390341 -0.907016 0.116073 -0.073818
50% 0.210131 0.248303 0.193435 0.058509
75% 1.401632 0.856122 0.235431 0.655625
max 2.145753 1.559510 0.279513 1.896458

2.2.4 对数据的转置,按轴进行排序,按值进行排序

df.T    
df.sort_index(axis=1,ascending=False)
df.sort(columns='B')

2.3 选择

虽然标准的python/numpy的选择和设置表达式都能够直接派上用场,但是作为工程使用的代码,我们推荐使用经过优化的pandas数据访问 方式:.at .iat .iloc .loc .ix
获取:

2.3.1 选择一个单独的列,这将会返回一个Series,等同于df.A

print(df['A'])

//

2017-08-01    0.559156
2017-08-02 -1.381263
2017-08-03 0.127286
2017-08-04 0.771317
2017-08-05 0.763690
2017-08-06 -0.123249
Freq: D, Name: A, dtype: float64

2.3.2 通过多个标签在多个轴上进行选择 用loc 两个中括号

print(df.loc[:,['A','B']])

//

A         B
2017-08-01 0.040603 1.519448
2017-08-02 -2.429148 0.874873
2017-08-03 -0.315154 0.285603
2017-08-04 0.878968 -0.130717
2017-08-05 0.201083 -0.280419
2017-08-06 1.089989 0.754818

2.3.3 标签切片,类似于2

print(df.loc['20170802':'20170804',['A','B']])

//

A         B
2017-08-02 1.923047 0.324715
2017-08-03 -0.349817 0.085359
2017-08-04 -0.675602 1.166312

2.3.4 对于返回的对象进行维度缩减

print(df.loc['20170803',['A','B']])

//

A   -1.133357
B -1.008605
Name: 2017-08-03 00:00:00, dtype: float64

2.3.5 获取一个标量以及快速访问一个标量。(意思等价)

df.loc(dates[0],'A')
df.at[dates[0],'A']

2.4 通过位置选择

2.4.1 通过传递数值进行 位置选择(选择的是行)

print(df)
print(df.iloc[3])# 取第四行的数据

//

2017-08-01  0.150731 -0.167577  1.524639 -0.324676
2017-08-02 -0.118888 -0.100348 0.666961 0.618953
2017-08-03 0.634666 -0.937274 0.280727 -2.504798
2017-08-04 0.395685 -0.650699 -0.573055 -2.597323
2017-08-05 1.403470 0.636823 0.289136 1.294045
2017-08-06 -1.337154 1.718215 -0.093115 -1.529378
A 0.395685
B -0.650699
C -0.573055
D -2.597323
Name: 2017-08-04 00:00:00, dtype: float64

2.4.1 通过数值进行切片

与numpy /python 中的情况类似

print(df.iloc[3:5,0:2])

//

A         B
2017-08-04 1.064645 1.346498
2017-08-05 -0.253172 0.713125
2.4.2 通过指定一个位置的列表

与numpy/python 中的情况类似

print(df.iloc[[1,2,4],[0,2]])

//

A         C
2017-08-02 -0.987798 -0.081209
2017-08-03 0.039192 -0.244556
2017-08-05 -0.043044 -0.337542
2.4.3 对行和列进行切片
print(df.iloc[1:3,:])#对行进行切片
print(df.iloc[:,1:3]) #对列进行切片

//

A         B         C         D
2017-08-02 0.598266 0.410904 -0.587577 0.979511
2017-08-03 -0.206643 -0.859047 -0.110966 -0.696556
B C
2017-08-01 -0.341683 -0.378477
2017-08-02 0.410904 -0.587577
2017-08-03 -0.859047 -0.110966
2017-08-04 -0.486971 0.248343
2017-08-05 2.641393 -1.229395
2017-08-06 1.023585 0.630160

2.4.4 获取特定的值

print(df.iloc[1,1])
print(df.iat[1,1])

//

0.426774827739
0.426774827739

2.5 布尔索引

2.5.1 使用一个单独列的值来选择数据

print(df[df.A > 0])

//

A         B         C         D
2017-08-01 1.113008 -1.663732 -0.321016 0.079169
2017-08-03 0.000043 0.217510 0.929183 0.687162
2017-08-04 0.167007 -2.045754 0.318775 -0.168735
2017-08-05 0.721842 0.982646 0.367562 0.468250
2017-08-06 0.853404 -0.344893 2.266556 -0.304485

2.5.2 使用where操作来选择数据

print(df[df>0])

//

A         B         C         D
2017-08-01 0.183377 NaN 0.411767 0.535607
2017-08-02 0.169412 1.331423 0.815747 NaN
2017-08-03 0.807875 NaN NaN NaN
2017-08-04 0.559794 NaN 1.384336 0.011442
2017-08-05 0.250613 NaN NaN NaN
2017-08-06 NaN NaN NaN NaN

2.5.3 使用isin()方法来过滤

df2 = df.copy()
df2['E'] = ['one','one','two','three','four','three']
print(df2)
print(df2[df2['E'].isin(['one','four'])])

//

A         B         C         D      E
2017-08-01 0.812330 1.333924 -1.267516 0.536965 one
2017-08-02 0.775528 1.071335 -1.344108 2.075409 one
2017-08-03 -0.826466 2.613913 0.153449 0.475277 two
2017-08-04 -0.190586 -0.613647 -0.044821 -1.224435 three
2017-08-05 -0.025830 -0.432906 0.990122 0.583328 four
2017-08-06 1.258012 -1.513605 2.128874 0.615266 three
A B C D E
2017-08-01 0.812330 1.333924 -1.267516 0.536965 one
2017-08-02 0.775528 1.071335 -1.344108 2.075409 one
2017-08-05 -0.025830 -0.432906 0.990122 0.583328 four

2.6 设置

2.6.1 设置新的列

s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20170801',periods=6))
print(s1)

//

2017-08-01    1
2017-08-02 2
2017-08-03 3
2017-08-04 4
2017-08-05 5
2017-08-06 6
Freq: D, dtype: int64

2.6.2 通过标签设置新的值

pdf.at[dates[0],'A']=0

2.6.3 通过位置设置新的值:

df.iat[0,1]=0

2.6.4 通过一个numpy数组设置一组新值

df.loc[:'D'] = np.array([5]*len(df))

2.6.5 通过where操作来设置新的值

df2 =df.copy()
df2[df2>0]=-df2
print(df2)

//

A         B         C         D
2017-08-01 -0.994060 -0.915353 -0.103166 -0.400363
2017-08-02 -0.483559 -0.775021 -0.549397 -0.525969
2017-08-03 -1.394320 -1.430371 -1.592913 -0.479944
2017-08-04 -0.164794 -0.728599 -0.187677 -1.950128
2017-08-05 -0.827711 -0.678634 -1.089334 -1.224033
2017-08-06 -0.159199 -0.204388 -0.697539 -0.193927

2.7 缺失值的处理

在pandas中,使用np.nan来代替缺失值, 这些值将默认不会包含在计算中,

2.7.1 reindex()方法对指定轴上的索引进行改变/增加、删除

这将返回原始数据的一个拷贝

df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df2.loc[0:2,'E'] = 1
print(df2)

//

A         B         C         D    E
2017-08-01 -0.631494 -0.427917 -2.263502 0.499206 1.0
2017-08-02 -0.363730 -1.641072 0.251015 1.127523 1.0
2017-08-03 1.284590 -0.249052 -0.410623 0.304096 NaN
2017-08-04 1.166190 1.371307 -0.636533 0.959298 NaN

2.7.2 去掉包含缺失值的行

df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df2.loc[0:2,'E'] = 1
print(df2)
print(df2.dropna(how='any'))

//

A         B         C         D    E
2017-08-01 0.844367 0.282881 -0.596537 1.106986 1.0
2017-08-02 -0.297101 -0.935122 0.100942 2.362901 1.0
2017-08-03 0.775431 0.187812 1.244166 1.549058 NaN
2017-08-04 0.146374 0.604248 -0.275827 -1.339286 NaN
A B C D E
2017-08-01 0.844367 0.282881 -0.596537 1.106986 1.0
2017-08-02 -0.297101 -0.935122 0.100942 2.362901 1.0

2.7.3 对缺失值进行补充

print(df2.fillna(value=5))

//

A         B         C         D    E
2017-08-01 0.134828 -0.409695 0.398769 -0.607453 1.0
2017-08-02 0.461754 -0.555881 1.073785 2.268998 1.0
2017-08-03 2.494864 0.894457 -1.012622 0.225400 5.0
2017-08-04 -0.789087 -0.632234 -1.368574 -1.554802 5.0

2.7.4 对数据进行布尔填充

print(pd.isnull(df2))

//

A      B      C      D      E
2017-08-01 False False False False False
2017-08-02 False False False False False
2017-08-03 False False False False True
2017-08-04 False False False False True

2.8 相关操作

2.8.1 执行描述性统计

print(df)
print(df.mean())
print(df.mean(1))

//

A         B         C         D
2017-08-01 -0.211556 0.151164 0.827327 -1.055808
2017-08-02 -0.181771 -0.388966 2.045035 0.118686
2017-08-03 -1.790196 -1.234819 0.432210 0.390979
2017-08-04 0.240327 0.018650 0.558570 0.044662
2017-08-05 -0.336993 -0.496467 2.512642 0.198416
2017-08-06 0.824033 -0.083081 0.021884 3.237294
A -0.242693
B -0.338920
C 1.066278
D 0.489038
dtype: float64
2017-08-01 -0.072218
2017-08-02 0.398246
2017-08-03 -0.550457
2017-08-04 0.215553
2017-08-05 0.469400
2017-08-06 1.000032
Freq: D, dtype: float64

对于拥有不同维度,需要对齐的对象进行操作pandas会自动的沿着指定的维度进行广播:

s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
print(s)

//

2017-08-01    NaN
2017-08-02 NaN
2017-08-03 1.0
2017-08-04 3.0
2017-08-05 5.0
2017-08-06 NaN
Freq: D, dtype: float64
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
print(s)
print(df.sub(s,axis='index'))

//

2017-08-01    NaN
2017-08-02 NaN
2017-08-03 1.0
2017-08-04 3.0
2017-08-05 5.0
2017-08-06 NaN
Freq: D, dtype: float64
A B C D
2017-08-01 NaN NaN NaN NaN
2017-08-02 NaN NaN NaN NaN
2017-08-03 -1.830701 0.504959 -0.709195 -0.847686
2017-08-04 -3.172816 -2.310923 -1.233996 -1.966726
2017-08-05 -4.368654 -4.300136 -3.945498 -3.271250
2017-08-06 NaN NaN NaN NaN
2.8.2 对数据应用函数
print(df)
print(df.apply(np.cumsum))# 累计求和

//

A         B         C         D
2017-08-01 2.643719 -1.955916 -0.625852 -0.200130
2017-08-02 0.510827 -1.552670 -0.669396 1.146157
2017-08-03 -0.168740 0.055871 -0.108548 1.198155
2017-08-04 -1.524010 0.844605 -0.157555 1.387963
2017-08-05 0.220436 -1.345371 0.038241 -0.055306
2017-08-06 -1.299420 -1.663129 -2.586310 1.218230
A B C D
2017-08-01 2.643719 -1.955916 -0.625852 -0.200130
2017-08-02 3.154545 -3.508586 -1.295248 0.946027
2017-08-03 2.985805 -3.452715 -1.403796 2.144182
2017-08-04 1.461795 -2.608110 -1.561352 3.532144
2017-08-05 1.682231 -3.953481 -1.523111 3.476838
2017-08-06 0.382811 -5.616610 -4.109421 4.695068
print(df.apply(lambda x:x.max() - x.min()))

//

A    1.510451
B 3.455243
C 1.772878
D 2.703384
dtype: float64

4、value_counts()

s = pd.Series(np.random.randint(0,7,size=10))
print(s)
print(s.value_counts())
print(type(s.value_counts()))

print(pd.value_counts(s))

//

0    1
1 1
2 3
3 5
4 2
5 4
6 2
7 0
8 2
9 6
dtype: int32
2 3
1 2
6 1
5 1
3 1
0 1
dtype: int64
<class 'pandas.core.series.Series'>
2 3
1 2
6 1
5 1
4 1
3 1
0 1
dtype: int64

2.9 字符串方法

Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。

s.str.lower()

#将s Series 字符串变成小写。

2.10 合并

Pandas 提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并 。

2.10.1 concat

df = pd.DataFrame(np.random.randn(10,4))
print("is--",df)
pieces = [df[:3],df[3:7],df[7:]]
df1 = pd.concat(pieces)
print("is==",df1)

//

is--           0         1         2         3
0 -1.033090 -0.884844 -0.868251 0.504889
1 0.264918 -2.486587 -1.985965 -1.773131
2 -1.056175 -0.647278 -2.123045 0.292224
3 0.664369 -0.185612 -0.628173 0.214924
4 0.740074 0.515744 -2.547639 -0.116405
5 -0.578900 -0.966765 0.455904 -0.623074
6 -2.000682 1.712143 0.532171 0.807520
7 -1.604986 1.015452 2.270513 0.734573
8 -0.817223 0.211418 0.543868 0.930036
9 -0.288155 -1.231245 -1.162904 0.372432
is== 0 1 2 3
0 -1.033090 -0.884844 -0.868251 0.504889
1 0.264918 -2.486587 -1.985965 -1.773131
2 -1.056175 -0.647278 -2.123045 0.292224
3 0.664369 -0.185612 -0.628173 0.214924
4 0.740074 0.515744 -2.547639 -0.116405
5 -0.578900 -0.966765 0.455904 -0.623074
6 -2.000682 1.712143 0.532171 0.807520
7 -1.604986 1.015452 2.270513 0.734573
8 -0.817223 0.211418 0.543868 0.930036
9 -0.288155 -1.231245 -1.162904 0.372432

####Join类似与SQL类型的合并

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
print(left)
print(right)
print(pd.merge(left,right,on='key))

//

key  lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 foo 5
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

2.10.2 append将一行连接到DataFrame

print(df)
s =df.iloc[3]
print(df.append(s,ignore_index = True))

//

A         B         C         D
2017-08-01 0.040061 -0.552461 -1.442001 -0.443429
2017-08-02 1.850017 -0.526213 0.548660 -0.760185
2017-08-03 1.233839 -0.338561 -1.441615 0.008451
2017-08-04 -0.622002 -1.308038 -0.926464 -0.890123
2017-08-05 -0.271971 0.994797 -0.806631 -0.191609
2017-08-06 0.078366 -0.014979 0.603089 -1.145712
A B C D
0 0.040061 -0.552461 -1.442001 -0.443429
1 1.850017 -0.526213 0.548660 -0.760185
2 1.233839 -0.338561 -1.441615 0.008451
3 -0.622002 -1.308038 -0.926464 -0.890123
4 -0.271971 0.994797 -0.806631 -0.191609
5 0.078366 -0.014979 0.603089 -1.145712
6 -0.622002 -1.308038 -0.926464 -0.890123

2.11 分组

对于”group by”操作,我们通常是指以下一个或多个操作步骤:

  • (Splitting)按照一些规则将数据分为不同的组;
  • (Applying)对于每组数据分别执行一个函数;
  • (Combining)将结果组合到一个数据结构中;

1、 分组并对每个分组执行sum函数:
2、 通过多个列进行分组形成一个层次索引,然后执行函数:


python 【模块】pandas数组处理详解_python