pandas 之索引重塑

原创

致于数据科学家的小陈 2022-08-23 10:05:19 博主文章分类：数据分析 ©著作权

文章标签 pivot reshape 长宽重塑 concat unstack 文章分类 运维

©著作权归作者所有：来自51CTO博客作者致于数据科学家的小陈的原创作品，请联系作者获取转载授权，否则将追究法律责任

索引重塑reshape

import numpy as np 
import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.

unstack
This pivots from the rows into the columns.

I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
    index=pd.Index(['Ohio', 'Colorado'], name='state'),
    columns=pd.Index(['one', 'two', 'three'],
    name='number'))

data

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()

result

'stack 将每一行, 叠成一个Series, 堆起来'






state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()

'unstack 将叠起来的Series, 变回DF'

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

result.unstack(level='state')

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack()  # 外连接哦

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

%time data2.unstack().stack()

Wall time: 5 ms





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

%time data2.unstack().stack(dropna=False)

Wall time: 3 ms





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))

df

	side	left	right
state	number
Ohio	one	0	5
two	1	6
three	2	7
Colorado	one	3	8
two	4	9
three	5	10

df.unstack("state")

side	left	right
state	Ohio	Colorado	Ohio	Colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack('state').stack('side')

Wall time: 118 ms

	state	Colorado	Ohio
number	side
one	left	3	0
right	8	5
two	left	4	1
right	9	6
three	left	5	2
right	10	7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year        203 non-null float64
quarter     203 non-null float64
realgdp     203 non-null float64
realcons    203 non-null float64
realinv     203 non-null float64
realgovt    203 non-null float64
realdpi     203 non-null float64
cpi         203 non-null float64
m1          203 non-null float64
tbilrate    203 non-null float64
unemp       203 non-null float64
pop         203 non-null float64
infl        203 non-null float64
realint     203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms

data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')

columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

# 修改列索引名
data = data.reindex(columns=columns)

data.index = periods.to_timestamp('D', 'end')

ldata = data.stack().reset_index().rename(columns={0:'value'})


ldata[:10]

	date	item	value
0	1959-03-31	realgdp	2710.349
1	1959-03-31	infl	0.000
2	1959-03-31	unemp	5.800
3	1959-06-30	realgdp	2778.801
4	1959-06-30	infl	2.340
5	1959-06-30	unemp	5.100
6	1959-09-30	realgdp	2775.488
7	1959-09-30	infl	2.740
8	1959-09-30	unemp	5.300
9	1959-12-31	realgdp	2785.204

This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:

pivoted = ldata.pivot('date', 'item', 'value')

pivoted[:5]

item	infl	realgdp	unemp
date
1959-03-31	0.00	2710.349	5.8
1959-06-30	2.34	2778.801	5.1
1959-09-30	2.74	2775.488	5.3
1959-12-31	0.27	2785.204	5.6
1960-03-31	2.31	2847.699	5.2

The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

ldata['valu2'] = np.random.randn(len(ldata))

ldata[:10]

	date	item	value	valu2
0	1959-03-31	realgdp	2710.349	-0.143460
1	1959-03-31	infl	0.000	-0.422318
2	1959-03-31	unemp	5.800	0.389872
3	1959-06-30	realgdp	2778.801	-0.208526
4	1959-06-30	infl	2.340	-1.538956
5	1959-06-30	unemp	5.100	-0.143273
6	1959-09-30	realgdp	2775.488	0.385763
7	1959-09-30	infl	2.740	0.564365
8	1959-09-30	unemp	5.300	0.266295
9	1959-12-31	realgdp	2785.204	-1.267871

By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Wide to Long

An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:

df = pd.DataFrame({
    'key': ['foo', 'bar', 'baz'],
    'A':[1,2,3],
    'B':[4,5,6],
    'C':[7,8,9]
})

df

	key	A	B	C
0	foo	1	4	7
1	bar	2	5	8
2	baz	3	6	9

The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:

melted = pd.melt(df, ['key'])

melted

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6
6	foo	C	7
7	bar	C	8
8	baz	C	9

Using pivot, we can reshape back to the original layout:(布局)

reshaped = melted.pivot('key', 'variable', 'value')

reshaped

variable	A	B	C
key
bar	2	5	8
baz	3	6	9
foo	1	4	7

Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

reshaped.reset_index()

variable	key	A	B	C
0	bar	2	5	8
1	baz	3	6	9
2	foo	1	4	7

You can also specify a subset of columns to use as value columns:

pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6

pandas.melt can be used without any group identifiers, too:

pd.melt(df, value_vars=['A', 'B', 'C'])

	variable	value
0	A	1
1	A	2
2	A	3
3	B	4
4	B	5
5	B	6
6	C	7
7	C	8
8	C	9

pd.melt(df, value_vars=['key', 'A', 'B'])

	variable	value
0	key	foo
1	key	bar
2	key	baz
3	A	1
4	A	2
5	A	3
6	B	4
7	B	5
8	B	6

小结

Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.

耐心和恒心, 总会获得回报的.