前言:我们的获得数据是离散的,但是我们想将相关的,可对比的数据联系在一起,那么我们就需要将离散的数据合并在一起。当然不同目的的合并,会有不同的方法。这次的课程目的就是讨论离散数据合并的事情。

import numpy as np
import pandas as pd

层次化索引

我们知道一组数据的一个轴向,如果按照不同的分发会有不同的索引,如果将不同的索引同时表现在一张表上,这时候的索引叫做层次化索引

先看看Series

data = pd.Series(['a','s','d','d','f','f','g','h'])
print(data)
0    a
1    s
2    d
3    d
4    f
5    f
6    g
7    h
dtype: object
#一个索引下,简历新的索引
data = pd.Series([1,23,4,5,6,7,8],index=[['大','大','大','小','小','小','中'],['a','s','d','f','g','h','t']]) #index的--->方向是从外到里的方向
print(data)
大  a     1
   s    23
   d     4
小  f     5
   g     6
   h     7
中  t     8
dtype: int64

用不同的索引,会有不同的结果,当然索引之间也会互相的影响

data['大']
a     1
s    23
d     4
dtype: int64
data['大','a']
1

根据上面的表达,你有没有发现上面的取值的形式和DataFrame很像,是的使用unstack()方法可以将数据的形式变成DataFrame

dstyle = data.unstack()
print(dstyle)
a    d    f    g    h     s    t
中  NaN  NaN  NaN  NaN  NaN   NaN  8.0
大  1.0  4.0  NaN  NaN  NaN  23.0  NaN
小  NaN  NaN  5.0  6.0  7.0   NaN  NaN
#反之亦然
print(dstyle.stack())
中  t     8.0
大  a     1.0
   d     4.0
   s    23.0
小  f     5.0
   g     6.0
   h     7.0
dtype: float64

当然在DataFrame也有分层索引,构建的方法是多维数组

frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))))
print(frame)
0      1      2      3      4      5
0  443.0   96.0  233.0  490.0  374.0  346.0
1  900.0  289.0  777.0  913.0  203.0  964.0
2  265.0  246.0  408.0  278.0  434.0  327.0
3  268.0   52.0  419.0  310.0   91.0  707.0
4  646.0   13.0  767.0  791.0  989.0   21.0
5  937.0  326.0  928.0   54.0  294.0  419.0
frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
print(frame)
a                    f             r
                  a      s      d      f      g      h
Dave  age     598.0  322.0   65.0  624.0  570.0  404.0
Wasa  age     132.0  317.0  392.0  829.0  544.0  453.0
Dave  money   475.0  968.0   13.0  744.0  263.0  181.0
Json  home    680.0  323.0  863.0  520.0  531.0  851.0
      grade   487.0  840.0  247.0  527.0  512.0  422.0
Honey talent  686.0  581.0  784.0  288.0  908.0  299.0
print(frame['r'])
h
Dave  age     404.0
Wasa  age     453.0
Dave  money   181.0
Json  home    851.0
      grade   422.0
Honey talent  299.0

为了更好的说明索引本身的含义,我们可以为每个索引命名,使用index.names(),column.names()

frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     index=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     columns=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

print(frame)
名字                a                    f             r
标签                a      s      d      f      g      h
字母    瞎写                                              
Dave  age     911.0  758.0  168.0  740.0  442.0   79.0
Wasa  age     913.0  140.0  368.0  529.0    9.0  691.0
Dave  money   401.0  821.0  584.0  665.0  306.0  314.0
Json  home    969.0  828.0  816.0  328.0  413.0  435.0
      grade    31.0  845.0  107.0  982.0   92.0  839.0
Honey talent  154.0  479.0  579.0  976.0   46.0  870.0
test = pd.DataFrame(np.ceil(np.random.uniform(1,100,(5,5))))
print(test)
0     1     2     3     4
0  27.0  64.0  13.0  85.0  79.0
1  15.0  59.0  42.0  18.0   7.0
2  69.0  41.0  39.0  12.0  95.0
3  95.0  46.0  71.0  98.0  65.0
4  45.0  44.0  95.0  96.0  36.0

重排与分级排序

我们设计了分层索引的索引名称,但是设计好的东西并不是一成不变的,我们可能存在替换或者改动的情况,拿替换来说,我们将索引的顺序替换使用的是swaplevel()

frame = pd.DataFrame(np.ceil(np.random.uniform(1,1000,(6,6))),
                     columns=[['Dave','Wasa','Dave','Json','Json','Honey'],['age','age','money','home','grade','talent']],
                     index=[['a','a','a','f','f','r'],['a','s','d','f','g','h']])
frame.index.names=['字母','瞎写']
frame.columns.names=['名字','标签']

print(frame.swaplevel(0,1))
名字      Dave   Wasa   Dave   Json         Honey
标签       age    age  money   home  grade talent
瞎写 字母                                          
a  a   736.0  441.0  823.0  313.0  376.0  490.0
s  a    12.0  300.0  449.0  261.0  931.0  726.0
d  a   262.0  881.0  554.0  469.0  461.0  533.0
f  f   469.0  471.0  914.0  608.0  667.0  787.0
g  f    22.0  426.0  268.0  790.0  809.0  747.0
h  r   669.0   33.0   72.0  551.0  243.0  104.0
print(frame.sort_index(level=0))   #这里的level是对于从外往里索引的序号(最外边的是0)
名字      Dave   Wasa   Dave   Json         Honey
标签       age    age  money   home  grade talent
字母 瞎写                                          
a  a   736.0  441.0  823.0  313.0  376.0  490.0
   d   262.0  881.0  554.0  469.0  461.0  533.0
   s    12.0  300.0  449.0  261.0  931.0  726.0
f  f   469.0  471.0  914.0  608.0  667.0  787.0
   g    22.0  426.0  268.0  790.0  809.0  747.0
r  h   669.0   33.0   72.0  551.0  243.0  104.0

根据级别汇总统计

题目简单来说就是,我们有多层索引,对于某一个我们感兴趣的索引,我们统计其数据

想要做到对某个索引的统计,我们需要注意几点:1.是哪个索引?这个通过level=来确定。2.哪个方向?通过axis=来确定

frame = pd.DataFrame(np.ceil(np.random.uniform(1,10,(5,5))),index=[['a','s','a','f','g'],['z','x','c','a','s']],columns=[[1,2,3,4,5],[1,'s','d','f','re']])
frame.index.names=['key1','key2']
frame.columns.names=['time','color']
print(frame)
time          1    2     3    4    5
color         1    s     d    f   re
key1 key2                           
a    z      2.0  8.0   4.0  5.0  2.0
s    x      3.0  8.0   6.0  2.0  6.0
a    c     10.0  8.0   3.0  8.0  6.0
f    a      8.0  8.0  10.0  6.0  8.0
g    s      4.0  9.0   7.0  7.0  3.0
#特定索引求和,记住我们的几点
print(frame.sum(level='color',axis=1))
color         1    s     d    f   re
key1 key2                           
a    z      2.0  8.0   4.0  5.0  2.0
s    x      3.0  8.0   6.0  2.0  6.0
a    c     10.0  8.0   3.0  8.0  6.0
f    a      8.0  8.0  10.0  6.0  8.0
g    s      4.0  9.0   7.0  7.0  3.0
print(frame.sum(level='key1'))
time      1     2     3     4    5
color     1     s     d     f   re
key1                              
a      12.0  16.0   7.0  13.0  8.0
s       3.0   8.0   6.0   2.0  6.0
f       8.0   8.0  10.0   6.0  8.0
g       4.0   9.0   7.0   7.0  3.0

使DataFrame的列变成索引

对于数据本身,我直接将DataFrame的列拿来当索引,索引的内容是行索引,使用的方法是set_index()

data = pd.DataFrame({'a':[1,2,3,4],'b':['one','two','three','four'],'c':range(4)})
print(data)
a      b  c
0  1    one  0
1  2    two  1
2  3  three  2
3  4   four  3
print(data.set_index(['c']))
a      b
c          
0  1    one
1  2    two
2  3  three
3  4   four

其中里面也有参数是drop=,drop默认数值是True,代表被当做index的列被抹去,如果改为Flase,那么这列就还在,看例子

print(data.set_index(['c'],drop=False))
a      b  c
c             
0  1    one  0
1  2    two  1
2  3  three  2
3  4   four  3

如果想回去,或者说一个index想变为数据的一部分,使用reset_index()

data01 = data.set_index(['c'])
print(data01.reset_index(['c']))
c  a      b
0  0  1    one
1  1  2    two
2  2  3  three
3  3  4   four

合并数据集

这一节的内容是比较绕的,对于几组数据的合并,不同的需求有不同的合并方法,join(),concat(),merge()

数据库型风格的DataFrame合并

简单说就是使用merge()方法可以实现DataFrame表格的SQL运算

df1 = pd.DataFrame({'key':['a','s','d','f','g','h'],'data1':range(6)})
df2 = pd.DataFrame({'key':['a','a','d'],'data2':range(3)})
#例子1:
print(pd.merge(df1,df2))
key  data1  data2
0   a      0      0
1   a      0      1
2   d      2      2

这里有以个默认,是以key为轴,所以如合并的时候,最好需要确定这个轴(使用on = )

print(pd.merge(df1,df2,on='key'))
key  data1  data2
0   a      0      0
1   a      0      1
2   d      2      2
#还有其他情况
print(pd.merge(df1,df2,on='key',how='outer'))
key  data1  data2
0   a      0    0.0
1   a      0    1.0
2   s      1    NaN
3   d      2    2.0
4   f      3    NaN
5   g      4    NaN
6   h      5    NaN

我们从上面注意到,改变了变量how=改变了合并的方向,当how='outer'的时候,合并取得是并集,how默认参数是innner,取得是交集

inner:使用两个表都有的键
outer:使用两个表中所有的键
left:使用左边中所有的键
right:使用右表中所有的键

print(pd.merge(df1,df2,on='key',how='left'))
key  data1  data2
0   a      0    0.0
1   a      0    1.0
2   s      1    NaN
3   d      2    2.0
4   f      3    NaN
5   g      4    NaN
6   h      5    NaN
print(pd.merge(df1,df2,on='key',how='right'))
key  data1  data2
0   a      0      0
1   a      0      1
2   d      2      2

还有一种情况:

left = pd.DataFrame({'key1':['a','s','d','f','g','h','j'],'val':['one','two','three','four','fiv','six','seven'],
                    'key2':[1,2,3,4,5,6,7]})
right =  pd.DataFrame({'key1':['a','s','d'],'val':['one','two','three'],
                    'key2':[1,2,2]})
print(pd.merge(left,right,on=['key1','key2']))
key1 val_x  key2 val_y
0    a   one     1   one
1    s   two     2   two

这种情况是有两个列当做轴,所以做法是使用两个列的元素组成元组,然后再去做和上面一样的比较

索引上的合并

我们想合并数据,有需要用索引合并的需求,在merge()方法中使用left_index=和right_index=可以使用索引合并

left = pd.DataFrame({'NUM':range(4),'time':range(4,8)},index=['a','b','c','e'])
print(left)
NUM  time
a    0     4
b    1     5
c    2     6
e    3     7
right = pd.DataFrame({'code':range(6),'push':range(100,106),'key':['a','b','c','d','e','f']})
print(right)
code  push key
0     0   100   a
1     1   101   b
2     2   102   c
3     3   103   d
4     4   104   e
5     5   105   f
print(pd.merge(left,right,right_on='key',left_index=True))
NUM  time  code  push key
0    0     4     0   100   a
1    1     5     1   101   b
2    2     6     2   102   c
4    3     7     4   104   e
print(pd.merge(left,right,right_on='key',left_index=True,how='outer'))
NUM  time  code  push key
0  0.0   4.0     0   100   a
1  1.0   5.0     1   101   b
2  2.0   6.0     2   102   c
4  3.0   7.0     4   104   e
3  NaN   NaN     3   103   d
5  NaN   NaN     5   105   f

当你拥有了right_on和left_on,你就拥有了控制那一列当做合并的轴;当你发现可以使用left_index和right_index,你连索引都可以当做轴来合并,可以说是很完备了

如果说,单个索引搞定了,多层次索引就是一次类推罢了

left = pd.DataFrame({'name':['Dave','Json','Hash','Happy'],'year':['2000','2001','2002','2003'],'money':['1318','1551','15315','48644']})
right = pd.DataFrame({'age':['18','18','19','20']},index=[['Dave','Dave','Lashi','Hash'],['2000','2001','2004','2005']])
print(left)
name  year  money
0   Dave  2000   1318
1   Json  2001   1551
2   Hash  2002  15315
3  Happy  2003  48644
print(right)
age
Dave  2000  18
      2001  18
Lashi 2004  19
Hash  2005  20
print(pd.merge(left,right,left_on=['name','year'],right_index=True))
name  year money age
0  Dave  2000  1318  18

然后说一个特殊的情况,就是两个数据的都是以索引来合并的

left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
                   index=['a','b','c','d'])
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
                   index=['a','e','g','d'])
print(left)
name1 year1 money1
a   Dave  2000   1318
b   Json  2001   1551
c   Hash  2002  15315
d  Happy  2003  48644
print(right)
name2 year2 money2
a   Andong  1955   1318
e     Json  2001   1551
g  Beihang  1999  15315
d    Happy  2003  48644
print(pd.merge(right,left,left_index=True,right_index=True))
name2 year2 money2  name1 year1 money1
a  Andong  1955   1318   Dave  2000   1318
d   Happy  2003  48644  Happy  2003  48644

对于上面的这种情况,pandas中的join()函数的可以完成

print(left.join(right,how='inner'))
#join()的要求是不能有overleap项
name1 year1 money1   name2 year2 money2
a   Dave  2000   1318  Andong  1955   1318
d  Happy  2003  48644   Happy  2003  48644

轴向连接

说道轴向连接,我的理解是,就像几个表叠加在一起,如果没有的列就在后面添加上,如果有重合的就让它重合,如果位置有重合,但是数值没有重合,那么就会引发错误

在轴向连接的方法是concat(),里面的参数慢慢讲

left = pd.DataFrame({'name1':['Dave','Json','Hash','Happy'],'year1':['2000','2001','2002','2003'],'money1':['1318','1551','15315','48644']},
                   index=['a','b','c','d'])
right = pd.DataFrame({'name2':['Andong','Json','Beihang','Happy'],'year2':['1955','2001','1999','2003'],'money2':['1318','1551','15315','48644']},
                   index=['a','e','g','d'])
print(pd.concat([left,right]))
money1 money2  name1    name2 year1 year2
a   1318    NaN   Dave      NaN  2000   NaN
b   1551    NaN   Json      NaN  2001   NaN
c  15315    NaN   Hash      NaN  2002   NaN
d  48644    NaN  Happy      NaN  2003   NaN
a    NaN   1318    NaN   Andong   NaN  1955
e    NaN   1551    NaN     Json   NaN  2001
g    NaN  15315    NaN  Beihang   NaN  1999
d    NaN  48644    NaN    Happy   NaN  2003


C:UsersLenovoAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

从Series开始

data1 = pd.Series(range(5))
data2 = pd.Series(range(7,12),index=range(5,10))
print(data)
a      b  c
0  1    one  0
1  2    two  1
2  3  three  2
3  4   four  3
print(data2)
5     7
6     8
7     9
8    10
9    11
dtype: int64
print(pd.concat([data1,data2]))
0     0
1     1
2     2
3     3
4     4
5     7
6     8
7     9
8    10
9    11
dtype: int64

如果我想要通过加入标签来区分不同的Series,那么我需要添加剂的参数是keys=

a=pd.concat([data1,data2],keys=['key1','key2'])
print(a)
key1  0     0
      1     1
      2     2
      3     3
      4     4
key2  5     7
      6     8
      7     9
      8    10
      9    11
dtype: int64
print(a.unstack())
0    1    2    3    4    5    6    7     8     9
key1  0.0  1.0  2.0  3.0  4.0  NaN  NaN  NaN   NaN   NaN
key2  NaN  NaN  NaN  NaN  NaN  7.0  8.0  9.0  10.0  11.0

我们不知不觉的添加了多层索引

print(pd.concat([data1,data2],axis=1,keys=['key1','key2']))  #你这样看是增加了colunm,其实是将不同的Series作区分
key1  key2
0   0.0   NaN
1   1.0   NaN
2   2.0   NaN
3   3.0   NaN
4   4.0   NaN
5   NaN   7.0
6   NaN   8.0
7   NaN   9.0
8   NaN  10.0
9   NaN  11.0

我们如果可以自选index来显示,那么我们可以使用参数join_axes=

print(pd.concat([data1,data2],axis=1,keys=['key1','key2'],join_axes=[[1,2,5,3]]))  #你这样看是增加了colunm,其实是将不同的Series作区分
key1  key2
1   1.0   NaN
2   2.0   NaN
5   NaN   7.0
3   3.0   NaN

合并重叠数据

我们前面的数据合并是merge和concat,问题在于当我们遇到数轴名称重合时,我们需要有一种方法将空缺的是数据填充

在Series中,如果有缺失的数据,使用numpy的where来与其他数据填充

#举个例子
test_a = pd.Series(['a','s','d','f','g'])
test_a[4]=np.nan
print(test_a)
0      a
1      s
2      d
3      f
4    NaN
dtype: object
test_b = pd.Series(['q','w','e','r','t'])
print(test_b)
0    q
1    w
2    e
3    r
4    t
dtype: object
np.info(np.where)
where(condition, [x, y])

Return elements chosen from `x` or `y` depending on `condition`.

.. note::
    When only `condition` is provided, this function is a shorthand for
    ``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
    preferred, as it behaves correctly for subclasses. The rest of this
    documentation covers only the case where all three arguments are
    provided.

Parameters
----------
condition : array_like, bool
    Where True, yield `x`, otherwise yield `y`.
x, y : array_like
    Values from which to choose. `x`, `y` and `condition` need to be
    broadcastable to some shape.

Returns
-------
out : ndarray
    An array with elements from `x` where `condition` is True, and elements
    from `y` elsewhere.

See Also
--------
choose
nonzero : The function that is called when x and y are omitted

Notes
-----
If all the arrays are 1-D, `where` is equivalent to::

    [xv if c else yv
     for c, xv, yv in zip(condition, x, y)]

Examples
--------
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])

This can be used on multidimensional arrays too:

>>> np.where([[True, False], [True, True]],
...          [[1, 2], [3, 4]],
...          [[9, 8], [7, 6]])
array([[1, 8],
       [3, 4]])

The shapes of x, y, and the condition are broadcast together:

>>> x, y = np.ogrid[:3, :4]
>>> np.where(x < y, x, 10 + y)  # both x and 10+y are broadcast
array([[10,  0,  0,  0],
       [10, 11,  1,  1],
       [10, 11, 12,  2]])

>>> a = np.array([[0, 1, 2],
...               [0, 2, 4],
...               [0, 3, 6]])
>>> np.where(a < 4, a, -1)  # -1 is broadcast
array([[ 0,  1,  2],
       [ 0,  2, -1],
       [ 0,  3, -1]])
np.where(pd.isnull(test_a),test_b,test_a)  #如果符合判断,就输出test_b,如果不符合判断,就输出test_a
array(['a', 's', 'd', 'f', 't'], dtype=object)

就上面的实例,我们做一些数据的填充,具体做法就是将几个表重叠起来,做相互的映射

在DataFrame中combin_first()方法可以做到相同的对应

frame_a = pd.DataFrame({'a':['a','s','d','f'],'b':[np.nan,'x','c','r'],'c':['i','j','i',np.nan]})
frame_b = pd.DataFrame({'a':[np.nan,'c','d',np.nan],'b':['j','u','j',np.nan]})
print(frame_a)
a    b    c
0  a  NaN    i
1  s    x    j
2  d    c    i
3  f    r  NaN
print(frame_b)
a    b
0  NaN    j
1    c    u
2    d    j
3  NaN  NaN
print(frame_a.combine_first(frame_b))  #frame_a的空缺有相应位置的frame_b的有意义元素表示
a  b    c
0  a  j    i
1  s  x    j
2  d  c    i
3  f  r  NaN

重塑和轴向旋转

重塑层次化索引

我们在上面的操作中介绍了两个方法:stack(),unstack()

stack():使DataFrame---->Series
unstack():使Series---->DataFrame

data = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])
print(data)
a  b
z  1  4
x  2  5
c  3  6
v  4  7
a = data.stack()
print(a)
z  a    1
   b    4
x  a    2
   b    5
c  a    3
   b    6
v  a    4
   b    7
dtype: int64

这个时候我们可以认为,轴反转了。我们看看数据反转的方向和它最后反转的位置。他反转到了index的内部

print(a.unstack())
a  b
z  1  4
x  2  5
c  3  6
v  4  7
test = pd.Series(range(5),index=([['a','s','d','f','g'],['a','x','d','f','g']]))
print(test)
a  a    0
s  x    1
d  d    2
f  f    3
g  g    4
dtype: int64
print(test.unstack())
a    d    f    g    x
a  0.0  NaN  NaN  NaN  NaN
d  NaN  2.0  NaN  NaN  NaN
f  NaN  NaN  3.0  NaN  NaN
g  NaN  NaN  NaN  4.0  NaN
s  NaN  NaN  NaN  NaN  1.0

如果是轴的旋转,都是从最内部开始的

你还记得的我们可以设置数据的轴的名称,所以要想改变每次旋转轴都只能从最里面开始,我们可以在使用unstack和stack方法时在括号里面,写入要旋转的层数(最外层是0层)和层的名称

frame = pd.DataFrame(np.ceil(np.random.uniform(1,999,(4,4))),index=[['a','a','d','d'],['z','x','c','v']],
                     columns=[['haha','haha','lala','lala'],['q','w','e','r']])
print(frame)
haha          lala       
         q      w      e      r
a z  816.0  802.0  729.0  245.0
  x  102.0  511.0  227.0  494.0
d c  146.0  710.0  894.0  343.0
  v  466.0  196.0  580.0  141.0
frame.index.names=['key_a','state_a']
frame.columns.names=['key_b','state_b']
print(frame)
key_b           haha          lala       
state_b            q      w      e      r
key_a state_a                            
a     z        816.0  802.0  729.0  245.0
      x        102.0  511.0  227.0  494.0
d     c        146.0  710.0  894.0  343.0
      v        466.0  196.0  580.0  141.0
frame.unstack()  #index的最里层,变成了最外一层

key_b haha lala state_b q w e r state_a c v x z c v x z c v x z c v x z key_a a NaN NaN 102.0 816.0 NaN NaN 511.0 802.0 NaN NaN 227.0 729.0 NaN NaN 494.0 245.0 d 146.0 466.0 NaN NaN 710.0 196.0 NaN NaN 894.0 580.0 NaN NaN 343.0 141.0 NaN NaN

print(frame.stack(0))  #columns的最外层到了index的最内层
state_b                  e      q      r      w
key_a state_a key_b                            
a     z       haha     NaN  816.0    NaN  802.0
              lala   729.0    NaN  245.0    NaN
      x       haha     NaN  102.0    NaN  511.0
              lala   227.0    NaN  494.0    NaN
d     c       haha     NaN  146.0    NaN  710.0
              lala   894.0    NaN  343.0    NaN
      v       haha     NaN  466.0    NaN  196.0
              lala   580.0    NaN  141.0    NaN
frame.unstack(0) #index的最外一层在column的最内层

key_b haha lala state_b q w e r key_a a d a d a d a d state_a c NaN 146.0 NaN 710.0 NaN 894.0 NaN 343.0 v NaN 466.0 NaN 196.0 NaN 580.0 NaN 141.0 x 102.0 NaN 511.0 NaN 227.0 NaN 494.0 NaN z 816.0 NaN 802.0 NaN 729.0 NaN 245.0 NaN

print(frame.unstack(1))  #index的最外层去了columns的最内层
key_b     haha                                                    lala         
state_b      q                           w                           e          
state_a      c      v      x      z      c      v      x      z      c      v   
key_a                                                                           
a          NaN    NaN  102.0  816.0    NaN    NaN  511.0  802.0    NaN    NaN   
d        146.0  466.0    NaN    NaN  710.0  196.0    NaN    NaN  894.0  580.0

key_b                                              
state_b                    r                       
state_a      x      z      c      v      x      z  
key_a                                              
a        227.0  729.0    NaN    NaN  494.0  245.0  
d          NaN    NaN  343.0  141.0    NaN    NaN

我们看了这么多的例子,可以总结一个东西是:无论怎么转换,它的一定转换到所要去的最内层

将‘长格式‘旋转为‘宽格式’

题目具体是什么意思,我们导入数据来看一下

data = pd.read_csv('E:/Datawhale数据分析/PythonForDataAnalysis-master/PythonForDataAnalysis-master/ch08/macrodata.csv')

这个数据源的下载在:https://github.com/wen-fei/PythonForDataAnalysis

print(data.head())
year  quarter   realgdp  realcons  realinv  realgovt  realdpi    cpi  
0  1959.0      1.0  2710.349    1707.4  286.898   470.045   1886.9  28.98   
1  1959.0      2.0  2778.801    1733.7  310.859   481.301   1919.7  29.15   
2  1959.0      3.0  2775.488    1751.8  289.226   491.260   1916.4  29.35   
3  1959.0      4.0  2785.204    1753.7  299.356   484.052   1931.3  29.37   
4  1960.0      1.0  2847.699    1770.5  331.722   462.199   1955.5  29.54

      m1  tbilrate  unemp      pop  infl  realint  
0  139.7      2.82    5.8  177.146  0.00     0.00  
1  141.7      3.08    5.1  177.830  2.34     0.74  
2  140.5      3.82    5.3  178.657  2.74     1.09  
3  140.0      4.33    5.6  179.386  0.27     4.06  
4  139.6      3.50    5.2  180.007  2.31     1.19
periods=pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
print(periods)  #变为period()的list
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='data', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp','infl','unemp'],name='item')  
print(columns)
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns = columns)  #重新定义columns,然后表现出来
print(a.head())
z  a    1
   b    4
x  a    2
   b    5
c  a    3
dtype: int64
data.index = periods.to_timestamp('D','end')
print(data.head())
item                            realgdp  infl  unemp
data                                                
1959-03-31 23:59:59.999999999  2710.349  0.00    5.8
1959-06-30 23:59:59.999999999  2778.801  2.34    5.1
1959-09-30 23:59:59.999999999  2775.488  2.74    5.3
1959-12-31 23:59:59.999999999  2785.204  0.27    5.6
1960-03-31 23:59:59.999999999  2847.699  2.31    5.2
ldata = data.stack().reset_index()  #注意看这里,这里的数据发生了变化
print(ldata.head())  #这里的item是之前定义好的columns的名字
data     item         0
0 1959-03-31 23:59:59.999999999  realgdp  2710.349
1 1959-03-31 23:59:59.999999999     infl     0.000
2 1959-03-31 23:59:59.999999999    unemp     5.800
3 1959-06-30 23:59:59.999999999  realgdp  2778.801
4 1959-06-30 23:59:59.999999999     infl     2.340
ldata = data.stack().reset_index().rename(columns={0:'values'})
print(ldata.head())
data     item    values
0 1959-03-31 23:59:59.999999999  realgdp  2710.349
1 1959-03-31 23:59:59.999999999     infl     0.000
2 1959-03-31 23:59:59.999999999    unemp     5.800
3 1959-06-30 23:59:59.999999999  realgdp  2778.801
4 1959-06-30 23:59:59.999999999     infl     2.340

我们看出,item将不同的数据名称做了整合后,写在了一列上

我们更喜欢DataFrame中,每一个数据名称写在不同的列上面,这时候使用的是pivot()

piovted = ldata.pivot('data','item','values')     #DataFrame.pivot(index=None, columns=None, values=None)
print(piovted.head())                                    #pivot()函数可以理解为,数据可以按照index,columns,values的顺序返回新的frame
item                           infl   realgdp  unemp
data                                                
1959-03-31 23:59:59.999999999  0.00  2710.349    5.8
1959-06-30 23:59:59.999999999  2.34  2778.801    5.1
1959-09-30 23:59:59.999999999  2.74  2775.488    5.3
1959-12-31 23:59:59.999999999  0.27  2785.204    5.6
1960-03-31 23:59:59.999999999  2.31  2847.699    5.2

我们在这里ldata再加入一列数据value_a

ldata['value_a'] = np.random.randint(len(ldata))
print(ldata.head())
data     item    values  value_a
0 1959-03-31 23:59:59.999999999  realgdp  2710.349      371
1 1959-03-31 23:59:59.999999999     infl     0.000      371
2 1959-03-31 23:59:59.999999999    unemp     5.800      371
3 1959-06-30 23:59:59.999999999  realgdp  2778.801      371
4 1959-06-30 23:59:59.999999999     infl     2.340      371

如果在这个DataFrame中忽略piovt()的最后的参数,那么会形成分层索引*

piovted = ldata.pivot('data','item')
print(piovted.head())
values                 value_a              
item                            infl   realgdp unemp    infl realgdp unemp
data                                                                      
1959-03-31 23:59:59.999999999   0.00  2710.349   5.8     371     371   371
1959-06-30 23:59:59.999999999   2.34  2778.801   5.1     371     371   371
1959-09-30 23:59:59.999999999   2.74  2775.488   5.3     371     371   371
1959-12-31 23:59:59.999999999   0.27  2785.204   5.6     371     371   371
1960-03-31 23:59:59.999999999   2.31  2847.699   5.2     371     371   371

将‘宽格式’变为‘长格式’

所谓变为长格式就是讲一个列的不同表达,扩展开。比如上面的item通过piovt()函数来将不同参数展示在了不同的列

那么所谓的长格式就是,将不同的列因为某种区分,而转化在了同一个列上面,使用的方法:pd.melt()

pd.melt():https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

frame = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7]},index=['z','x','c','v'])
print(frame)
a  b
z  1  4
x  2  5
c  3  6
v  4  7
print(frame.melt())
variable  value
0        a      1
1        a      2
2        a      3
3        a      4
4        b      4
5        b      5
6        b      6
7        b      7
test = pd.melt(frame,id_vars='a')
print(test)
a variable  value
0  1        b      4
1  2        b      5
2  3        b      6
3  4        b      7
#开始将a和b一起放在一列里面
test = pd.melt(frame,value_vars=['a','b'])
print(test)
variable  value
0        a      1
1        a      2
2        a      3
3        a      4
4        b      4
5        b      5
6        b      6
7        b      7

如果是要转为宽格式呢?

print(test.reset_index(drop=True))
variable  value
0        a      1
1        a      2
2        a      3
3        a      4
4        b      4
5        b      5
6        b      6
7        b      7
shape = test.pivot(index=None,columns='variable',values='value')  #DataFrame.pivot(index=None, columns=None, values=None)
print(shape)
variable    a    b
0         1.0  NaN
1         2.0  NaN
2         3.0  NaN
3         4.0  NaN
4         NaN  4.0
5         NaN  5.0
6         NaN  6.0
7         NaN  7.0