前言:我们的获得数据是离散的,但是我们想将相关的,可对比的数据联系在一起,那么我们就需要将离散的数据合并在一起。当然不同目的的合并,会有不同的方法。这次的课程目的就是讨论离散数据合并的事情。
import numpy as np
import pandas as pd
层次化索引
我们知道一组数据的一个轴向,如果按照不同的分发会有不同的索引,如果将不同的索引同时表现在一张表上,这时候的索引叫做层次化索引
先看看Series
data = pd.Series(['a','s','d','d','f','f','g','h'])
print(data)
0 a
1 s
2 d
3 d
4 f
5 f
6 g
7 h
dtype: object
#一个索引下,简历新的索引
data = pd.Series([1,23,4,5,6,7,8],index=[['大','大','大','小','小','小','中'],['a','s','d','f','g','h','t']]) #index的--->方向是从外到里的方向
print(data)
大 a 1
s 23
d 4
小 f 5
g 6
h 7
中 t 8
dtype: int64
用不同的索引,会有不同的结果,当然索引之间也会互相的影响
data['大']
a 1
s 23
d 4
dtype: int64
data['大','a']
1
根据上面的表达,你有没有发现上面的取值的形式和DataFrame很像,是的使用unstack()方法可以将数据的形式变成DataFrame
dstyle = data.unstack()
print(dstyle)
a d f g h s t
中 NaN NaN NaN NaN NaN NaN 8.0
大 1.0 4.0 NaN NaN NaN 23.0 NaN
小 NaN NaN 5.0 6.0 7.0 NaN NaN
#反之亦然
print(dstyle.stack())
中 t 8.0
大 a 1.0
d 4.0
s 23.0
小 f 5.0
g 6.0
h 7.0
dtype: float64
当然在DataFrame也有分层索引,构建的方法是多维数组
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))))
print(frame)
0 1 2 3 4 5
0 443.0 96.0 233.0 490.0 374.0 346.0
1 900.0 289.0 777.0 913.0 203.0 964.0
2 265.0 246.0 408.0 278.0 434.0 327.0
3 268.0 52.0 419.0 310.0 91.0 707.0
4 646.0 13.0 767.0 791.0 989.0 21.0
5 937.0 326.0 928.0 54.0 294.0 419.0
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
print(frame)
a f r
a s d f g h
Dave age 598.0 322.0 65.0 624.0 570.0 404.0
Wasa age 132.0 317.0 392.0 829.0 544.0 453.0
Dave money 475.0 968.0 13.0 744.0 263.0 181.0
Json home 680.0 323.0 863.0 520.0 531.0 851.0
grade 487.0 840.0 247.0 527.0 512.0 422.0
Honey talent 686.0 581.0 784.0 288.0 908.0 299.0
print(frame['r'])
h
Dave age 404.0
Wasa age 453.0
Dave money 181.0
Json home 851.0
grade 422.0
Honey talent 299.0
为了更好的说明索引本身的含义,我们可以为每个索引命名,使用index.names(),column.names()
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']
print(frame)
名字 a f r
标签 a s d f g h
字母 瞎写
Dave age 911.0 758.0 168.0 740.0 442.0 79.0
Wasa age 913.0 140.0 368.0 529.0 9.0 691.0
Dave money 401.0 821.0 584.0 665.0 306.0 314.0
Json home 969.0 828.0 816.0 328.0 413.0 435.0
grade 31.0 845.0 107.0 982.0 92.0 839.0
Honey talent 154.0 479.0 579.0 976.0 46.0 870.0
test = pd.DataFrame(np.ceil(np.random.uniform(1,100,(5,5))))
print(test)
0 1 2 3 4
0 27.0 64.0 13.0 85.0 79.0
1 15.0 59.0 42.0 18.0 7.0
2 69.0 41.0 39.0 12.0 95.0
3 95.0 46.0 71.0 98.0 65.0
4 45.0 44.0 95.0 96.0 36.0
重排与分级排序
我们设计了分层索引的索引名称,但是设计好的东西并不是一成不变的,我们可能存在替换或者改动的情况,拿替换来说,我们将索引的顺序替换使用的是swaplevel()
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
columns=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
index=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']
print(frame.swaplevel(0,1))
名字 Dave Wasa Dave Json Honey
标签 age age money home grade talent
瞎写 字母
a a 736.0 441.0 823.0 313.0 376.0 490.0
s a 12.0 300.0 449.0 261.0 931.0 726.0
d a 262.0 881.0 554.0 469.0 461.0 533.0
f f 469.0 471.0 914.0 608.0 667.0 787.0
g f 22.0 426.0 268.0 790.0 809.0 747.0
h r 669.0 33.0 72.0 551.0 243.0 104.0
print(frame.sort_index(level=0)) #这里的level是对于从外往里索引的序号(最外边的是0)
名字 Dave Wasa Dave Json Honey
标签 age age money home grade talent
字母 瞎写
a a 736.0 441.0 823.0 313.0 376.0 490.0
d 262.0 881.0 554.0 469.0 461.0 533.0
s 12.0 300.0 449.0 261.0 931.0 726.0
f f 469.0 471.0 914.0 608.0 667.0 787.0
g 22.0 426.0 268.0 790.0 809.0 747.0
r h 669.0 33.0 72.0 551.0 243.0 104.0
根据级别汇总统计
题目简单来说就是,我们有多层索引,对于某一个我们感兴趣的索引,我们统计其数据
想要做到对某个索引的统计,我们需要注意几点:1.是哪个索引?这个通过level=来确定。2.哪个方向?通过axis=来确定
frame = pd.DataFrame(np.ceil(np.random.uniform(1,10,(5,5))),index=[['a','s','a','f','g'],['z','x','c','a','s']],columns=[[1,2,3,4,5],[1,'s','d','f','re']])
frame.index.names=['key1','key2']
frame.columns.names=['time','color']
print(frame)
time 1 2 3 4 5
color 1 s d f re
key1 key2
a z 2.0 8.0 4.0 5.0 2.0
s x 3.0 8.0 6.0 2.0 6.0
a c 10.0 8.0 3.0 8.0 6.0
f a 8.0 8.0 10.0 6.0 8.0
g s 4.0 9.0 7.0 7.0 3.0
#特定索引求和,记住我们的几点
print(frame.sum(level='color',axis=1))
color 1 s d f re
key1 key2
a z 2.0 8.0 4.0 5.0 2.0
s x 3.0 8.0 6.0 2.0 6.0
a c 10.0 8.0 3.0 8.0 6.0
f a 8.0 8.0 10.0 6.0 8.0
g s 4.0 9.0 7.0 7.0 3.0
print(frame.sum(level='key1'))
time 1 2 3 4 5
color 1 s d f re
key1
a 12.0 16.0 7.0 13.0 8.0
s 3.0 8.0 6.0 2.0 6.0
f 8.0 8.0 10.0 6.0 8.0
g 4.0 9.0 7.0 7.0 3.0
使DataFrame的列变成索引
对于数据本身,我直接将DataFrame的列拿来当索引,索引的内容是行索引,使用的方法是set_index()
data = pd.DataFrame({'a':[1,2,3,4],'b':['one','two','three','four'],'c':range(4)})
print(data)
a b c
0 1 one 0
1 2 two 1
2 3 three 2
3 4 four 3
print(data.set_index(['c']))
a b
c
0 1 one
1 2 two
2 3 three
3 4 four
其中里面也有参数是drop=,drop默认数值是True,代表被当做index的列被抹去,如果改为Flase,那么这列就还在,看例子
print(data.set_index(['c'],drop=False))
a b c
c
0 1 one 0
1 2 two 1
2 3 three 2
3 4 four 3
如果想回去,或者说一个index想变为数据的一部分,使用reset_index()
data01 = data.set_index(['c'])
print(data01.reset_index(['c']))
c a b
0 0 1 one
1 1 2 two
2 2 3 three
3 3 4 four
合并数据集
这一节的内容是比较绕的,对于几组数据的合并,不同的需求有不同的合并方法,join(),concat(),merge()
数据库型风格的DataFrame合并
简单说就是使用merge()方法可以实现DataFrame表格的SQL运算
df1 = pd.DataFrame({'key':['a','s','d','f','g','h'],'data1':range(6)})
df2 = pd.DataFrame({'key':['a','a','d'],'data2':range(3)})
#例子1:
print(pd.merge(df1,df2))
key data1 data2
0 a 0 0
1 a 0 1
2 d 2 2
这里有以个默认,是以key为轴,所以如合并的时候,最好需要确定这个轴(使用on = )
print(pd.merge(df1,df2,on='key'))
key data1 data2
0 a 0 0
1 a 0 1
2 d 2 2
#还有其他情况
print(pd.merge(df1,df2,on='key',how='outer'))
key data1 data2
0 a 0 0.0
1 a 0 1.0
2 s 1 NaN
3 d 2 2.0
4 f 3 NaN
5 g 4 NaN
6 h 5 NaN
我们从上面注意到,改变了变量how=改变了合并的方向,当how='outer'的时候,合并取得是并集,how默认参数是innner,取得是交集
inner:使用两个表都有的键
outer:使用两个表中所有的键
left:使用左边中所有的键
right:使用右表中所有的键
print(pd.merge(df1,df2,on='key',how='left'))
key data1 data2
0 a 0 0.0
1 a 0 1.0
2 s 1 NaN
3 d 2 2.0
4 f 3 NaN
5 g 4 NaN
6 h 5 NaN
print(pd.merge(df1,df2,on='key',how='right'))
key data1 data2
0 a 0 0
1 a 0 1
2 d 2 2
还有一种情况:
left = pd.DataFrame({'key1':['a','s','d','f','g','h','j'],'val':['one','two','three','four','fiv','six','seven'],
'key2':[1,2,3,4,5,6,7]})
right = pd.DataFrame({'key1':['a','s','d'],'val':['one','two','three'],
'key2':[1,2,2]})
print(pd.merge(left,right,on=['key1','key2']))
key1 val_x key2 val_y
0 a one 1 one
1 s two 2 two
这种情况是有两个列当做轴,所以做法是使用两个列的元素组成元组,然后再去做和上面一样的比较
索引上的合并
我们想合并数据,有需要用索引合并的需求,在merge()方法中使用left_index=和right_index=可以使用索引合并
left = pd.DataFrame({'NUM':range(4),'time':range(4,8)},index=['a','b','c','e'])
print(left)
NUM time
a 0 4
b 1 5
c 2 6
e 3 7
right = pd.DataFrame({'code':range(6),'push':range(100,106),'key':['a','b','c','d','e','f']})
print(right)
code push key
0 0 100 a
1 1 101 b
2 2 102 c
3 3 103 d
4 4 104 e
5 5 105 f
print(pd.merge(left,right,right_on='key',left_index=True))
NUM time code push key
0 0 4 0 100 a
1 1 5 1 101 b
2 2 6 2 102 c
4 3 7 4 104 e
print(pd.merge(left,right,right_on='key',left_index=True,how='outer'))
NUM time code push key
0 0.0 4.0 0 100 a
1 1.0 5.0 1 101 b
2 2.0 6.0 2 102 c
4 3.0 7.0 4 104 e
3 NaN NaN 3 103 d
5 NaN NaN 5 105 f
当你拥有了right_on和left_on,你就拥有了控制那一列当做合并的轴;当你发现可以使用left_index和right_index,你连索引都可以当做轴来合并,可以说是很完备了
如果说,单个索引搞定了,多层次索引就是一次类推罢了
left = pd.DataFrame({'name':['Dave','Json','Hash','Happy'],'year':['2000','2001','2002','2003'],'money':['1318','1551','15315','48644']})
right = pd.DataFrame({'age':['18','18','19','20']},index=[['Dave','Dave','Lashi','Hash'],['2000','2001','2004','2005']])
print(left)
name year money
0 Dave 2000 1318
1 Json 2001 1551
2 Hash 2002 15315
3 Happy 2003 48644
print(right)
age
Dave 2000 18
2001 18
Lashi 2004 19
Hash 2005 20
print(pd.merge(left,right,left_on=['name','year'],right_index=True))
name year money age
0 Dave 2000 1318 18
然后说一个特殊的情况,就是两个数据的都是以索引来合并的
left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
index=['a','b','c','d'])
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
index=['a','e','g','d'])
print(left)
name1 year1 money1
a Dave 2000 1318
b Json 2001 1551
c Hash 2002 15315
d Happy 2003 48644
print(right)
name2 year2 money2
a Andong 1955 1318
e Json 2001 1551
g Beihang 1999 15315
d Happy 2003 48644
print(pd.merge(right,left,left_index=True,right_index=True))
name2 year2 money2 name1 year1 money1
a Andong 1955 1318 Dave 2000 1318
d Happy 2003 48644 Happy 2003 48644
对于上面的这种情况,pandas中的join()函数的可以完成
print(left.join(right,how='inner'))
#join()的要求是不能有overleap项
name1 year1 money1 name2 year2 money2
a Dave 2000 1318 Andong 1955 1318
d Happy 2003 48644 Happy 2003 48644
轴向连接
说道轴向连接,我的理解是,就像几个表叠加在一起,如果没有的列就在后面添加上,如果有重合的就让它重合,如果位置有重合,但是数值没有重合,那么就会引发错误
在轴向连接的方法是concat(),里面的参数慢慢讲
left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
index=['a','b','c','d'])
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
index=['a','e','g','d'])
print(pd.concat([left,right]))
money1 money2 name1 name2 year1 year2
a 1318 NaN Dave NaN 2000 NaN
b 1551 NaN Json NaN 2001 NaN
c 15315 NaN Hash NaN 2002 NaN
d 48644 NaN Happy NaN 2003 NaN
a NaN 1318 NaN Andong NaN 1955
e NaN 1551 NaN Json NaN 2001
g NaN 15315 NaN Beihang NaN 1999
d NaN 48644 NaN Happy NaN 2003
C:UsersLenovoAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
从Series开始
data1 = pd.Series(range(5))
data2 = pd.Series(range(7,12),index=range(5,10))
print(data)
a b c
0 1 one 0
1 2 two 1
2 3 three 2
3 4 four 3
print(data2)
5 7
6 8
7 9
8 10
9 11
dtype: int64
print(pd.concat([data1,data2]))
0 0
1 1
2 2
3 3
4 4
5 7
6 8
7 9
8 10
9 11
dtype: int64
如果我想要通过加入标签来区分不同的Series,那么我需要添加剂的参数是keys=
a=pd.concat([data1,data2],keys=['key1','key2'])
print(a)
key1 0 0
1 1
2 2
3 3
4 4
key2 5 7
6 8
7 9
8 10
9 11
dtype: int64
print(a.unstack())
0 1 2 3 4 5 6 7 8 9
key1 0.0 1.0 2.0 3.0 4.0 NaN NaN NaN NaN NaN
key2 NaN NaN NaN NaN NaN 7.0 8.0 9.0 10.0 11.0
我们不知不觉的添加了多层索引
print(pd.concat([data1,data2],axis=1,keys=['key1','key2'])) #你这样看是增加了colunm,其实是将不同的Series作区分
key1 key2
0 0.0 NaN
1 1.0 NaN
2 2.0 NaN
3 3.0 NaN
4 4.0 NaN
5 NaN 7.0
6 NaN 8.0
7 NaN 9.0
8 NaN 10.0
9 NaN 11.0
我们如果可以自选index来显示,那么我们可以使用参数join_axes=
print(pd.concat([data1,data2],axis=1,keys=['key1','key2'],join_axes=[[1,2,5,3]])) #你这样看是增加了colunm,其实是将不同的Series作区分
key1 key2
1 1.0 NaN
2 2.0 NaN
5 NaN 7.0
3 3.0 NaN
合并重叠数据
我们前面的数据合并是merge和concat,问题在于当我们遇到数轴名称重合时,我们需要有一种方法将空缺的是数据填充
在Series中,如果有缺失的数据,使用numpy的where来与其他数据填充
#举个例子
test_a = pd.Series(['a','s','d','f','g'])
test_a[4]=np.nan
print(test_a)
0 a
1 s
2 d
3 f
4 NaN
dtype: object
test_b = pd.Series(['q','w','e','r','t'])
print(test_b)
0 q
1 w
2 e
3 r
4 t
dtype: object
np.info(np.where)
where(condition, [x, y])
Return elements chosen from `x` or `y` depending on `condition`.
.. note::
When only `condition` is provided, this function is a shorthand for
``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
preferred, as it behaves correctly for subclasses. The rest of this
documentation covers only the case where all three arguments are
provided.
Parameters
----------
condition : array_like, bool
Where True, yield `x`, otherwise yield `y`.
x, y : array_like
Values from which to choose. `x`, `y` and `condition` need to be
broadcastable to some shape.
Returns
-------
out : ndarray
An array with elements from `x` where `condition` is True, and elements
from `y` elsewhere.
See Also
--------
choose
nonzero : The function that is called when x and y are omitted
Notes
-----
If all the arrays are 1-D, `where` is equivalent to::
[xv if c else yv
for c, xv, yv in zip(condition, x, y)]
Examples
--------
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
This can be used on multidimensional arrays too:
>>> np.where([[True, False], [True, True]],
... [[1, 2], [3, 4]],
... [[9, 8], [7, 6]])
array([[1, 8],
[3, 4]])
The shapes of x, y, and the condition are broadcast together:
>>> x, y = np.ogrid[:3, :4]
>>> np.where(x < y, x, 10 + y) # both x and 10+y are broadcast
array([[10, 0, 0, 0],
[10, 11, 1, 1],
[10, 11, 12, 2]])
>>> a = np.array([[0, 1, 2],
... [0, 2, 4],
... [0, 3, 6]])
>>> np.where(a < 4, a, -1) # -1 is broadcast
array([[ 0, 1, 2],
[ 0, 2, -1],
[ 0, 3, -1]])
np.where(pd.isnull(test_a),test_b,test_a) #如果符合判断,就输出test_b,如果不符合判断,就输出test_a
array(['a', 's', 'd', 'f', 't'], dtype=object)
就上面的实例,我们做一些数据的填充,具体做法就是将几个表重叠起来,做相互的映射
在DataFrame中combin_first()方法可以做到相同的对应
frame_a = pd.DataFrame({'a':['a','s','d','f'],'b':[np.nan,'x','c','r'],'c':['i','j','i',np.nan]})
frame_b = pd.DataFrame({'a':[np.nan,'c','d',np.nan],'b':['j','u','j',np.nan]})
print(frame_a)
a b c
0 a NaN i
1 s x j
2 d c i
3 f r NaN
print(frame_b)
a b
0 NaN j
1 c u
2 d j
3 NaN NaN
print(frame_a.combine_first(frame_b)) #frame_a的空缺有相应位置的frame_b的有意义元素表示
a b c
0 a j i
1 s x j
2 d c i
3 f r NaN
重塑和轴向旋转
重塑层次化索引
我们在上面的操作中介绍了两个方法:stack(),unstack()
stack():使DataFrame---->Series
unstack():使Series---->DataFrame
data = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])
print(data)
a b
z 1 4
x 2 5
c 3 6
v 4 7
a = data.stack()
print(a)
z a 1
b 4
x a 2
b 5
c a 3
b 6
v a 4
b 7
dtype: int64
这个时候我们可以认为,轴反转了。我们看看数据反转的方向和它最后反转的位置。他反转到了index的内部
print(a.unstack())
a b
z 1 4
x 2 5
c 3 6
v 4 7
test = pd.Series(range(5),index=([['a','s','d','f','g'],['a','x','d','f','g']]))
print(test)
a a 0
s x 1
d d 2
f f 3
g g 4
dtype: int64
print(test.unstack())
a d f g x
a 0.0 NaN NaN NaN NaN
d NaN 2.0 NaN NaN NaN
f NaN NaN 3.0 NaN NaN
g NaN NaN NaN 4.0 NaN
s NaN NaN NaN NaN 1.0
如果是轴的旋转,都是从最内部开始的
你还记得的我们可以设置数据的轴的名称,所以要想改变每次旋转轴都只能从最里面开始,我们可以在使用unstack和stack方法时在括号里面,写入要旋转的层数(最外层是0层)和层的名称
frame = pd.DataFrame(np.ceil(np.random.uniform(1,999,(4,4))),index=[['a','a','d','d'],['z','x','c','v']],
columns=[['haha','haha','lala','lala'],['q','w','e','r']])
print(frame)
haha lala
q w e r
a z 816.0 802.0 729.0 245.0
x 102.0 511.0 227.0 494.0
d c 146.0 710.0 894.0 343.0
v 466.0 196.0 580.0 141.0
frame.index.names=['key_a','state_a']
frame.columns.names=['key_b','state_b']
print(frame)
key_b haha lala
state_b q w e r
key_a state_a
a z 816.0 802.0 729.0 245.0
x 102.0 511.0 227.0 494.0
d c 146.0 710.0 894.0 343.0
v 466.0 196.0 580.0 141.0
frame.unstack() #index的最里层,变成了最外一层
key_b haha lala state_b q w e r state_a c v x z c v x z c v x z c v x z key_a a NaN NaN 102.0 816.0 NaN NaN 511.0 802.0 NaN NaN 227.0 729.0 NaN NaN 494.0 245.0 d 146.0 466.0 NaN NaN 710.0 196.0 NaN NaN 894.0 580.0 NaN NaN 343.0 141.0 NaN NaN
print(frame.stack(0)) #columns的最外层到了index的最内层
state_b e q r w
key_a state_a key_b
a z haha NaN 816.0 NaN 802.0
lala 729.0 NaN 245.0 NaN
x haha NaN 102.0 NaN 511.0
lala 227.0 NaN 494.0 NaN
d c haha NaN 146.0 NaN 710.0
lala 894.0 NaN 343.0 NaN
v haha NaN 466.0 NaN 196.0
lala 580.0 NaN 141.0 NaN
frame.unstack(0) #index的最外一层在column的最内层
key_b haha lala state_b q w e r key_a a d a d a d a d state_a c NaN 146.0 NaN 710.0 NaN 894.0 NaN 343.0 v NaN 466.0 NaN 196.0 NaN 580.0 NaN 141.0 x 102.0 NaN 511.0 NaN 227.0 NaN 494.0 NaN z 816.0 NaN 802.0 NaN 729.0 NaN 245.0 NaN
print(frame.unstack(1)) #index的最外层去了columns的最内层
key_b haha lala
state_b q w e
state_a c v x z c v x z c v
key_a
a NaN NaN 102.0 816.0 NaN NaN 511.0 802.0 NaN NaN
d 146.0 466.0 NaN NaN 710.0 196.0 NaN NaN 894.0 580.0
key_b
state_b r
state_a x z c v x z
key_a
a 227.0 729.0 NaN NaN 494.0 245.0
d NaN NaN 343.0 141.0 NaN NaN
我们看了这么多的例子,可以总结一个东西是:无论怎么转换,它的一定转换到所要去的最内层
将‘长格式‘旋转为‘宽格式’
题目具体是什么意思,我们导入数据来看一下
data = pd.read_csv('E:/Datawhale数据分析/PythonForDataAnalysis-master/PythonForDataAnalysis-master/ch08/macrodata.csv')
这个数据源的下载在:https://github.com/wen-fei/PythonForDataAnalysis
print(data.head())
year quarter realgdp realcons realinv realgovt realdpi cpi
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54
m1 tbilrate unemp pop infl realint
0 139.7 2.82 5.8 177.146 0.00 0.00
1 141.7 3.08 5.1 177.830 2.34 0.74
2 140.5 3.82 5.3 178.657 2.74 1.09
3 140.0 4.33 5.6 179.386 0.27 4.06
4 139.6 3.50 5.2 180.007 2.31 1.19
periods=pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
print(periods) #变为period()的list
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', name='data', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
print(columns)
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns = columns) #重新定义columns,然后表现出来
print(a.head())
z a 1
b 4
x a 2
b 5
c a 3
dtype: int64
data.index = periods.to_timestamp('D','end')
print(data.head())
item realgdp infl unemp
data
1959-03-31 23:59:59.999999999 2710.349 0.00 5.8
1959-06-30 23:59:59.999999999 2778.801 2.34 5.1
1959-09-30 23:59:59.999999999 2775.488 2.74 5.3
1959-12-31 23:59:59.999999999 2785.204 0.27 5.6
1960-03-31 23:59:59.999999999 2847.699 2.31 5.2
ldata = data.stack().reset_index() #注意看这里,这里的数据发生了变化
print(ldata.head()) #这里的item是之前定义好的columns的名字
data item 0
0 1959-03-31 23:59:59.999999999 realgdp 2710.349
1 1959-03-31 23:59:59.999999999 infl 0.000
2 1959-03-31 23:59:59.999999999 unemp 5.800
3 1959-06-30 23:59:59.999999999 realgdp 2778.801
4 1959-06-30 23:59:59.999999999 infl 2.340
ldata = data.stack().reset_index().rename(columns={0:'values'})
print(ldata.head())
data item values
0 1959-03-31 23:59:59.999999999 realgdp 2710.349
1 1959-03-31 23:59:59.999999999 infl 0.000
2 1959-03-31 23:59:59.999999999 unemp 5.800
3 1959-06-30 23:59:59.999999999 realgdp 2778.801
4 1959-06-30 23:59:59.999999999 infl 2.340
我们看出,item将不同的数据名称做了整合后,写在了一列上
我们更喜欢DataFrame中,每一个数据名称写在不同的列上面,这时候使用的是pivot()
piovted = ldata.pivot('data','item','values') #DataFrame.pivot(index=None, columns=None, values=None)
print(piovted.head()) #pivot()函数可以理解为,数据可以按照index,columns,values的顺序返回新的frame
item infl realgdp unemp
data
1959-03-31 23:59:59.999999999 0.00 2710.349 5.8
1959-06-30 23:59:59.999999999 2.34 2778.801 5.1
1959-09-30 23:59:59.999999999 2.74 2775.488 5.3
1959-12-31 23:59:59.999999999 0.27 2785.204 5.6
1960-03-31 23:59:59.999999999 2.31 2847.699 5.2
我们在这里ldata再加入一列数据value_a
ldata['value_a'] = np.random.randint(len(ldata))
print(ldata.head())
data item values value_a
0 1959-03-31 23:59:59.999999999 realgdp 2710.349 371
1 1959-03-31 23:59:59.999999999 infl 0.000 371
2 1959-03-31 23:59:59.999999999 unemp 5.800 371
3 1959-06-30 23:59:59.999999999 realgdp 2778.801 371
4 1959-06-30 23:59:59.999999999 infl 2.340 371
如果在这个DataFrame中忽略piovt()的最后的参数,那么会形成分层索引*
piovted = ldata.pivot('data','item')
print(piovted.head())
values value_a
item infl realgdp unemp infl realgdp unemp
data
1959-03-31 23:59:59.999999999 0.00 2710.349 5.8 371 371 371
1959-06-30 23:59:59.999999999 2.34 2778.801 5.1 371 371 371
1959-09-30 23:59:59.999999999 2.74 2775.488 5.3 371 371 371
1959-12-31 23:59:59.999999999 0.27 2785.204 5.6 371 371 371
1960-03-31 23:59:59.999999999 2.31 2847.699 5.2 371 371 371
将‘宽格式’变为‘长格式’
所谓变为长格式就是讲一个列的不同表达,扩展开。比如上面的item通过piovt()函数来将不同参数展示在了不同的列
那么所谓的长格式就是,将不同的列因为某种区分,而转化在了同一个列上面,使用的方法:pd.melt()
pd.melt():https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
frame = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])
print(frame)
a b
z 1 4
x 2 5
c 3 6
v 4 7
print(frame.melt())
variable value
0 a 1
1 a 2
2 a 3
3 a 4
4 b 4
5 b 5
6 b 6
7 b 7
test = pd.melt(frame,id_vars='a')
print(test)
a variable value
0 1 b 4
1 2 b 5
2 3 b 6
3 4 b 7
#开始将a和b一起放在一列里面
test = pd.melt(frame,value_vars=['a','b'])
print(test)
variable value
0 a 1
1 a 2
2 a 3
3 a 4
4 b 4
5 b 5
6 b 6
7 b 7
如果是要转为宽格式呢?
print(test.reset_index(drop=True))
variable value
0 a 1
1 a 2
2 a 3
3 a 4
4 b 4
5 b 5
6 b 6
7 b 7
shape = test.pivot(index=None,columns='variable',values='value') #DataFrame.pivot(index=None, columns=None, values=None)
print(shape)
variable a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 4.0 NaN
4 NaN 4.0
5 NaN 5.0
6 NaN 6.0
7 NaN 7.0