合并
连接
pandas提供各种工具以简便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数.
请参阅合并部分
把pandas对象连接到一起
Python
In [ 73 ] : df = pd . DataFrame ( np . random . randn ( 10 , 4 ) )
In [ 74 ] : df
Out [ 74 ] :
0 1 2 3
0 - 0.548702 1.467327 - 1.015962 - 0.483075
1 1.637550 - 1.217659 - 0.291519 - 1.745505
2 - 0.263952 0.991460 - 0.919069 0.266046
3 - 0.709661 1.669052 1.037882 - 1.705775
4 - 0.919854 - 0.042379 1.247642 - 0.009920
5 0.290213 0.495767 0.362949 1.548106
6 - 1.131345 - 0.089329 0.337863 - 0.945867
7 - 0.932132 1.956030 0.017587 - 0.016692
8 - 0.575247 0.254161 - 1.143704 0.215897
9 1.193555 - 0.077118 - 0.408530 - 0.862495
# break it into pieces
In [ 75 ] : pieces = [ df [ : 3 ] , df [ 3 : 7 ] , df [ 7 : ] ]
In [ 76 ] : pd . concat ( pieces )
Out [ 76 ] :
0 1 2 3
0 - 0.548702 1.467327 - 1.015962 - 0.483075
1 1.637550 - 1.217659 - 0.291519 - 1.745505
2 - 0.263952 0.991460 - 0.919069 0.266046
3 - 0.709661 1.669052 1.037882 - 1.705775
4 - 0.919854 - 0.042379 1.247642 - 0.009920
5 0.290213 0.495767 0.362949 1.548106
6 - 1.131345 - 0.089329 0.337863 - 0.945867
7 - 0.932132 1.956030 0.017587 - 0.016692
8 - 0.575247 0.254161 - 1.143704 0.215897
9 1.193555 - 0.077118 - 0.408530 - 0.862495
连接
SQL样式合并. 请参阅 数据库style联接
Python
In [ 77 ] : left = pd . DataFrame ( { 'key' : [ 'foo' , 'foo' ] , 'lval' : [ 1 , 2 ] } )
In [ 78 ] : right = pd . DataFrame ( { 'key' : [ 'foo' , 'foo' ] , 'rval' : [ 4 , 5 ] } )
In [ 79 ] : left
Out [ 79 ] :
key lval
0 foo 1
1 foo 2
In [ 80 ] : right
Out [ 80 ] :
key rval
0 foo 4
1 foo 5
In [ 81 ] : pd . merge ( left , right , on = 'key' )
Out [ 81 ] :
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
添加
添加行到数据增. 参阅 添加
Python
In [ 82 ] : df = pd . DataFrame ( np . random . randn ( 8 , 4 ) , columns = [ 'A' , 'B' , 'C' , 'D' ] )
In [ 83 ] : df
Out [ 83 ] :
A B C D
0 1.346061 1.511763 1.627081 - 0.990582
1 - 0.441652 1.211526 0.268520 0.024580
2 - 1.577585 0.396823 - 0.105381 - 0.532532
3 1.453749 1.208843 - 0.080952 - 0.264610
4 - 0.727965 - 0.589346 0.339969 - 0.693205
5 - 0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 - 0.096701 0.803351 1.715071 - 0.708758
In [ 84 ] : s = df . iloc [ 3 ]
In [ 85 ] : df . append ( s , ignore_index = True )
Out [ 85 ] :
A B C D
0 1.346061 1.511763 1.627081 - 0.990582
1 - 0.441652 1.211526 0.268520 0.024580
2 - 1.577585 0.396823 - 0.105381 - 0.532532
3 1.453749 1.208843 - 0.080952 - 0.264610
4 - 0.727965 - 0.589346 0.339969 - 0.693205
5 - 0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 - 0.096701 0.803351 1.715071 - 0.708758
8 1.453749 1.208843 - 0.080952 - 0.264610
分组
对于“group by”指的是以下一个或多个处理
- 将数据按某些标准分割为不同的组
- 在每个独立组上应用函数
- 组合结果为一个数据结构
请参阅 分组部分
Python
In [ 86 ] : df = pd . DataFrame ( { 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' ,
. . . . : 'foo' , 'bar' , 'foo' , 'foo' ] ,
. . . . : 'B' : [ 'one' , 'one' , 'two' , 'three' ,
. . . . : 'two' , 'two' , 'one' , 'three' ] ,
. . . . : 'C' : np . random . randn ( 8 ) ,
. . . . : 'D' : np . random . randn ( 8 ) } )
. . . . :
In [ 87 ] : df
Out [ 87 ] :
A B C D
0 foo one - 1.202872 - 0.055224
1 bar one - 1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three - 0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two - 0.392670 - 0.136473
6 foo one 0.007207 - 0.561757
7 foo three 1.928123 - 1.623033
分组然后应用函数统计总和存放到结果组
Python
In [ 88 ] : df . groupby ( 'A' ) . sum ( )
Out [ 88 ] :
C D
A
bar - 2.802588 2.42611
foo 3.146492 - 0.63958
按多列分组为层次索引,然后应用函数
Python
In [ 89 ] : df . groupby ( [ 'A' , 'B' ] ) . sum ( )
Out [ 89 ] :
C D
A B
bar one - 1.814470 2.395985
three - 0.595447 0.166599
two - 0.392670 - 0.136473
foo one - 1.195665 - 0.616981
three 1.928123 - 1.623033
two 2.414034 1.600434
重塑
堆叠
Python
In [ 90 ] : tuples = list ( zip ( * [ [ 'bar' , 'bar' , 'baz' , 'baz' ,
. . . . : 'foo' , 'foo' , 'qux' , 'qux' ] ,
. . . . : [ 'one' , 'two' , 'one' , 'two' ,
. . . . : 'one' , 'two' , 'one' , 'two' ] ] ) )
. . . . :
In [ 91 ] : index = pd . MultiIndex . from_tuples ( tuples , names = [ 'first' , 'second' ] )
In [ 92 ] : df = pd . DataFrame ( np . random . randn ( 8 , 2 ) , index = index , columns = [ 'A' , 'B' ] )
In [ 93 ] : df2 = df [ : 4 ]
In [ 94 ] : df2
Out [ 94 ] :
A B
first second
bar one 0.029399 - 0.542108
two 0.282696 - 0.087302
baz one - 1.575170 1.771208
two 0.816482 1.100230
堆叠
Python
In [ 95 ] : stacked = df2 . stack ( )
In [ 96 ] : stacked
Out [ 96 ] :
first second
bar one A 0.029399
B - 0.542108
two A 0.282696
B - 0.087302
baz one A - 1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype : float64
多个索引作为索引), 其堆叠的反向操作是未堆栈, 上面的数据默认反堆叠到上一级别:
Python
In [ 97 ] : stacked . unstack ( )
Out [ 97 ] :
A B
first second
bar one 0.029399 - 0.542108
two 0.282696 - 0.087302
baz one - 1.575170 1.771208
two 0.816482 1.100230
In [ 98 ] : stacked . unstack ( 1 )
Out [ 98 ] :
second one two
first
bar A 0.029399 0.282696
B - 0.542108 - 0.087302
baz A - 1.575170 0.816482
B 1.771208 1.100230
In [ 99 ] : stacked . unstack ( 0 )
Out [ 99 ] :
first bar baz
second
one A 0.029399 - 1.575170
B - 0.542108 1.771208
two A 0.282696 0.816482
B - 0.087302 1.100230
数据透视表
查看数据透视表.
Python
In [ 100 ] : df = pd . DataFrame ( { 'A' : [ 'one' , 'one' , 'two' , 'three' ] * 3 ,
. . . . . : 'B' : [ 'A' , 'B' , 'C' ] * 4 ,
. . . . . : 'C' : [ 'foo' , 'foo' , 'foo' , 'bar' , 'bar' , 'bar' ] * 2 ,
. . . . . : 'D' : np . random . randn ( 12 ) ,
. . . . . : 'E' : np . random . randn ( 12 ) } )
. . . . . :
In [ 101 ] : df
Out [ 101 ] :
A B C D E
0 one A foo 1.418757 - 0.179666
1 one B foo - 1.879024 1.291836
2 two C foo 0.536826 - 0.009614
3 three A bar 1.006160 0.392149
4 one B bar - 0.029716 0.264599
5 one C bar - 1.146178 - 0.057409
6 two A foo 0.100900 - 1.425638
7 three B foo - 1.035018 1.024098
8 one C foo 0.314665 - 0.106062
9 one A bar - 0.773723 1.824375
10 two B bar - 1.170653 0.595974
11 three C bar 0.648740 1.167115
我们可以从此数据非常容易的产生数据透视表:
Python
In [ 102 ] : pd . pivot_table ( df , values = 'D' , index = [ 'A' , 'B' ] , columns = [ 'C' ] )
Out [ 102 ] :
C bar foo
A B
one A - 0.773723 1.418757
B - 0.029716 - 1.879024
C - 1.146178 0.314665
three A 1.006160 NaN
B NaN - 1.035018
C 0.648740 NaN
two A NaN 0.100900
B - 1.170653 NaN
C NaN 0.536826
时间序列
pandas有易用,强大且高效的函数用于高频数据重采样转换操作(例如,转换秒数据到5分钟数据), 这是很普遍的情况,但并不局限于金融应用, 请参阅时间序列章节
Python
In [ 103 ] : rng = pd . date_range ( '1/1/2012' , periods = 100 , freq = 'S' )
In [ 104 ] : ts = pd . Series ( np . random . randint ( 0 , 500 , len ( rng ) ) , index = rng )
In [ 105 ] : ts . resample ( '5Min' , how = 'sum' )
Out [ 105 ] :
2012 - 01 - 01 25083
Freq : 5T , dtype : int32
时区表示
Python
In [ 106 ] : rng = pd . date_range ( '3/6/2012 00:00' , periods = 5 , freq = 'D' )
In [ 107 ] : ts = pd . Series ( np . random . randn ( len ( rng ) ) , rng )
In [ 108 ] : ts
Out [ 108 ] :
2012 - 03 - 06 0.464000
2012 - 03 - 07 0.227371
2012 - 03 - 08 - 0.496922
2012 - 03 - 09 0.306389
2012 - 03 - 10 - 2.290613
Freq : D , dtype : float64
In [ 109 ] : ts_utc = ts . tz_localize ( 'UTC' )
In [ 110 ] : ts_utc
Out [ 110 ] :
2012 - 03 - 06 00 : 00 : 00 + 00 : 00 0.464000
2012 - 03 - 07 00 : 00 : 00 + 00 : 00 0.227371
2012 - 03 - 08 00 : 00 : 00 + 00 : 00 - 0.496922
2012 - 03 - 09 00 : 00 : 00 + 00 : 00 0.306389
2012 - 03 - 10 00 : 00 : 00 + 00 : 00 - 2.290613
Freq : D , dtype : float64
转换到其它时区
Python
In [ 111 ] : ts_utc . tz_convert ( 'US/Eastern' )
Out [ 111 ] :
2012 - 03 - 05 19 : 00 : 00 - 05 : 00 0.464000
2012 - 03 - 06 19 : 00 : 00 - 05 : 00 0.227371
2012 - 03 - 07 19 : 00 : 00 - 05 : 00 - 0.496922
2012 - 03 - 08 19 : 00 : 00 - 05 : 00 0.306389
2012 - 03 - 09 19 : 00 : 00 - 05 : 00 - 2.290613
Freq : D , dtype : float64
转换不同的时间跨度
Python
In [ 112 ] : rng = pd . date_range ( '1/1/2012' , periods = 5 , freq = 'M' )
In [ 113 ] : ts = pd . Series ( np . random . randn ( len ( rng ) ) , index = rng )
In [ 114 ] : ts
Out [ 114 ] :
2012 - 01 - 31 - 1.134623
2012 - 02 - 29 - 1.561819
2012 - 03 - 31 - 0.260838
2012 - 04 - 30 0.281957
2012 - 05 - 31 1.523962
Freq : M , dtype : float64
In [ 115 ] : ps = ts . to_period ( )
In [ 116 ] : ps
Out [ 116 ] :
2012 - 01 - 1.134623
2012 - 02 - 1.561819
2012 - 03 - 0.260838
2012 - 04 0.281957
2012 - 05 1.523962
Freq : M , dtype : float64
In [ 117 ] : ps . to_timestamp ( )
Out [ 117 ] :
2012 - 01 - 01 - 1.134623
2012 - 02 - 01 - 1.561819
2012 - 03 - 01 - 0.260838
2012 - 04 - 01 0.281957
2012 - 05 - 01 1.523962
Freq : MS , dtype : float64
转换时段并且使用一些运算函数, 下例中, 我们转换年报11月到季度结束每日上午9点数据
Python
In [ 118 ] : prng = pd . period_range ( '1990Q1' , '2000Q4' , freq = 'Q-NOV' )
In [ 119 ] : ts = pd . Series ( np . random . randn ( len ( prng ) ) , prng )
In [ 120 ] : ts . index = ( prng . asfreq ( 'M' , 'e' ) + 1 ) . asfreq ( 'H' , 's' ) + 9
In [ 121 ] : ts . head ( )
Out [ 121 ] :
1990 - 03 - 01 09 : 00 - 0.902937
1990 - 06 - 01 09 : 00 0.068159
1990 - 09 - 01 09 : 00 - 0.057873
1990 - 12 - 01 09 : 00 - 0.368204
1991 - 03 - 01 09 : 00 - 1.144073
Freq : H , dtype : float64
分类
数据桢中包含分类. 完整的文档, 请查看分类介绍 and the API文档.
Python
In [ 122 ] : df = pd . DataFrame ( { "id" : [ 1 , 2 , 3 , 4 , 5 , 6 ] , "raw_grade" : [ 'a' , 'b' , 'b' , 'a' , 'a' , 'e' ] } )
转换原始类别为分类数据类型.
Python
In [ 123 ] : df [ "grade" ] = df [ "raw_grade" ] . astype ( "category" )
In [ 124 ] : df [ "grade" ]
Out [ 124 ] :
0 a
1 b
2 b
3 a
4 a
5 e
Name : grade , dtype : category
Categories ( 3 , object ) : [ a , b , e ]
Series.cat.categories对应位置!)
Python
In [ 125 ] : df [ "grade" ] . cat . categories = [ "very good" , "good" , "very bad" ]
重排顺分类,同时添加缺少的分类(序列 .cat方法下返回新默认序列)
Python
In [ 126 ] : df [ "grade" ] = df [ "grade" ] . cat . set_categories ( [ "very bad" , "bad" , "medium" , "good" , "very good" ] )
In [ 127 ] : df [ "grade" ]
Out [ 127 ] :
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name : grade , dtype : category
Categories ( 5 , object ) : [ very bad , bad , medium , good , very good ]
排列分类中的顺序,不是按词汇排列.
Python
In [ 128 ] : df . sort ( "grade" )
Out [ 128 ] :
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
类别列分组,并且也显示空类别.
Python
In [ 129 ] : df . groupby ( "grade" ) . size ( )
Out [ 129 ] :
grade
very bad 1
bad NaN
medium NaN
good 2
very good 3
dtype : float64
绘图
绘图文档.
Python
In [ 130 ] : ts = pd . Series ( np . random . randn ( 1000 ) , index = pd . date_range ( '1/1/2000' , periods = 1000 ) )
In [ 131 ] : ts = ts . cumsum ( )
In [ 132 ] : ts . plot ( )
Out [ 132 ] : < matplotlib . axes . _subplots . AxesSubplot at 0xb02091ac >
在数据桢中,可以很方便的绘制带标签列:
Python
In [ 133 ] : df = pd . DataFrame ( np . random . randn ( 1000 , 4 ) , index = ts . index ,
. . . . . : columns = [ 'A' , 'B' , 'C' , 'D' ] )
. . . . . :
In [ 134 ] : df = df . cumsum ( )
In [ 135 ] : plt . figure ( ) ; df . plot ( ) ; plt . legend ( loc = 'best' )
Out [ 135 ] : < matplotlib . legend . Legend at 0xb01c9cac >
获取数据输入/输出
CSV
Python
In [ 136 ] : df . to_csv ( 'foo.csv' )
Python
In [ 137 ] : pd . read_csv ( 'foo.csv' )
Out [ 137 ] :
Unnamed : 0 A B C D
0 2000 - 01 - 01 0.266457 - 0.399641 - 0.219582 1.186860
1 2000 - 01 - 02 - 1.170732 - 0.345873 1.653061 - 0.282953
2 2000 - 01 - 03 - 1.734933 0.530468 2.060811 - 0.515536
3 2000 - 01 - 04 - 1.555121 1.452620 0.239859 - 1.156896
4 2000 - 01 - 05 0.578117 0.511371 0.103552 - 2.428202
5 2000 - 01 - 06 0.478344 0.449933 - 0.741620 - 1.962409
6 2000 - 01 - 07 1.235339 - 0.091757 - 1.543861 - 1.084753
. . . . . . . . . . . . . . . . .
993 2002 - 09 - 20 - 10.628548 - 9.153563 - 7.883146 28.313940
994 2002 - 09 - 21 - 10.390377 - 8.727491 - 6.399645 30.914107
995 2002 - 09 - 22 - 8.985362 - 8.485624 - 4.669462 31.367740
996 2002 - 09 - 23 - 9.558560 - 8.781216 - 4.499815 30.518439
997 2002 - 09 - 24 - 9.902058 - 9.340490 - 4.386639 30.105593
998 2002 - 09 - 25 - 10.216020 - 9.480682 - 3.933802 29.758560
999 2002 - 09 - 26 - 11.856774 - 10.671012 - 3.216025 29.369368
[ 1000 rows x 5 columns ]
HDF5
读写HDF存储
写入HDF5存储
Python
In [ 138 ] : df . to_hdf ( 'foo.h5' , 'df' )
读取HDF5存储
Python
In [ 139 ] : pd . read_hdf ( 'foo.h5' , 'df' )
Out [ 139 ] :
A B C D
2000 - 01 - 01 0.266457 - 0.399641 - 0.219582 1.186860
2000 - 01 - 02 - 1.170732 - 0.345873 1.653061 - 0.282953
2000 - 01 - 03 - 1.734933 0.530468 2.060811 - 0.515536
2000 - 01 - 04 - 1.555121 1.452620 0.239859 - 1.156896
2000 - 01 - 05 0.578117 0.511371 0.103552 - 2.428202
2000 - 01 - 06 0.478344 0.449933 - 0.741620 - 1.962409
2000 - 01 - 07 1.235339 - 0.091757 - 1.543861 - 1.084753
. . . . . . . . . . . . . . .
2002 - 09 - 20 - 10.628548 - 9.153563 - 7.883146 28.313940
2002 - 09 - 21 - 10.390377 - 8.727491 - 6.399645 30.914107
2002 - 09 - 22 - 8.985362 - 8.485624 - 4.669462 31.367740
2002 - 09 - 23 - 9.558560 - 8.781216 - 4.499815 30.518439
2002 - 09 - 24 - 9.902058 - 9.340490 - 4.386639 30.105593
2002 - 09 - 25 - 10.216020 - 9.480682 - 3.933802 29.758560
2002 - 09 - 26 - 11.856774 - 10.671012 - 3.216025 29.369368
[ 1000 rows x 4 columns ]
Excel
读写MS Excel
写入excel文件
Python
In [ 140 ] : df . to_excel ( 'foo.xlsx' , sheet_name = 'Sheet1' )
读取excel文件
Python
In [ 141 ] : pd . read_excel ( 'foo.xlsx' , 'Sheet1' , index_col = None , na_values = [ 'NA' ] )
Out [ 141 ] :
A B C D
2000 - 01 - 01 0.266457 - 0.399641 - 0.219582 1.186860
2000 - 01 - 02 - 1.170732 - 0.345873 1.653061 - 0.282953
2000 - 01 - 03 - 1.734933 0.530468 2.060811 - 0.515536
2000 - 01 - 04 - 1.555121 1.452620 0.239859 - 1.156896
2000 - 01 - 05 0.578117 0.511371 0.103552 - 2.428202
2000 - 01 - 06 0.478344 0.449933 - 0.741620 - 1.962409
2000 - 01 - 07 1.235339 - 0.091757 - 1.543861 - 1.084753
. . . . . . . . . . . . . . .
2002 - 09 - 20 - 10.628548 - 9.153563 - 7.883146 28.313940
2002 - 09 - 21 - 10.390377 - 8.727491 - 6.399645 30.914107
2002 - 09 - 22 - 8.985362 - 8.485624 - 4.669462 31.367740
2002 - 09 - 23 - 9.558560 - 8.781216 - 4.499815 30.518439
2002 - 09 - 24 - 9.902058 - 9.340490 - 4.386639 30.105593
2002 - 09 - 25 - 10.216020 - 9.480682 - 3.933802 29.758560
2002 - 09 - 26 - 11.856774 - 10.671012 - 3.216025 29.369368
[ 1000 rows x 4 columns ]
陷阱
如果尝试这样操作可能会看到像这样的异常:
Python
>>> if pd . Series ( [ False , True , False ] ) :
print ( "I was true" )
Traceback
. . .
ValueError : The truth value of an array is ambiguous . Use a . empty , a . any ( ) or a . all ( ) .
查看对照获取解释和怎么做的帮助
也可以查看陷阱.