索引重塑reshape

import numpy as np 
import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index
​ This "rotates" or pivots from the columns in the data to the rows.

unstack
​ This pivots from the rows into the columns.

I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))

data

number

one

two

three

state

Ohio

0

1

2

Colorado

3

4

5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()

result
'stack 将每一行, 叠成一个Series, 堆起来'






state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()
'unstack 将叠起来的Series, 变回DF'

number

one

two

three

state

Ohio

0

1

2

Colorado

3

4

5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

state

Ohio

Colorado

number

one

0

3

two

1

4

three

2

5

result.unstack(level='state')

state

Ohio

Colorado

number

one

0

3

two

1

4

three

2

5

Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2
one  a    0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack()  # 外连接哦

a

b

c

d

e

one

0.0

1.0

2.0

3.0

NaN

two

NaN

NaN

4.0

5.0

6.0

%time data2.unstack().stack()
Wall time: 5 ms





one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms





one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df

side

left

right

state

number

Ohio

one

0

5

two

1

6

three

2

7

Colorado

one

3

8

two

4

9

three

5

10

df.unstack("state")

side

left

right

state

Ohio

Colorado

Ohio

Colorado

number

one

0

3

5

8

two

1

4

6

9

three

2

5

7

10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack('state').stack('side')
Wall time: 118 ms

state

Colorado

Ohio

number

side

one

left

3

0

right

8

5

two

left

4

1

right

9

6

three

left

5

2

right

10

7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()

year

quarter

realgdp

realcons

realinv

realgovt

realdpi

cpi

m1

tbilrate

unemp

pop

infl

realint

0

1959.0

1.0

2710.349

1707.4

286.898

470.045

1886.9

28.98

139.7

2.82

5.8

177.146

0.00

0.00

1

1959.0

2.0

2778.801

1733.7

310.859

481.301

1919.7

29.15

141.7

3.08

5.1

177.830

2.34

0.74

2

1959.0

3.0

2775.488

1751.8

289.226

491.260

1916.4

29.35

140.5

3.82

5.3

178.657

2.74

1.09

3

1959.0

4.0

2785.204

1753.7

299.356

484.052

1931.3

29.37

140.0

4.33

5.6

179.386

0.27

4.06

4

1960.0

1.0

2847.699

1770.5

331.722

462.199

1955.5

29.54

139.6

3.50

5.2

180.007

2.31

1.19

periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')

columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

# 修改列索引名
data = data.reindex(columns=columns)

data.index = periods.to_timestamp('D', 'end')

ldata = data.stack().reset_index().rename(columns={0:'value'})


ldata[:10]

date

item

value

0

1959-03-31

realgdp

2710.349

1

1959-03-31

infl

0.000

2

1959-03-31

unemp

5.800

3

1959-06-30

realgdp

2778.801

4

1959-06-30

infl

2.340

5

1959-06-30

unemp

5.100

6

1959-09-30

realgdp

2775.488

7

1959-09-30

infl

2.740

8

1959-09-30

unemp

5.300

9

1959-12-31

realgdp

2785.204

This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:

pivoted = ldata.pivot('date', 'item', 'value')

pivoted[:5]

item

infl

realgdp

unemp

date

1959-03-31

0.00

2710.349

5.8

1959-06-30

2.34

2778.801

5.1

1959-09-30

2.74

2775.488

5.3

1959-12-31

0.27

2785.204

5.6

1960-03-31

2.31

2847.699

5.2

The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

ldata['valu2'] = np.random.randn(len(ldata))

ldata[:10]

date

item

value

valu2

0

1959-03-31

realgdp

2710.349

-0.143460

1

1959-03-31

infl

0.000

-0.422318

2

1959-03-31

unemp

5.800

0.389872

3

1959-06-30

realgdp

2778.801

-0.208526

4

1959-06-30

infl

2.340

-1.538956

5

1959-06-30

unemp

5.100

-0.143273

6

1959-09-30

realgdp

2775.488

0.385763

7

1959-09-30

infl

2.740

0.564365

8

1959-09-30

unemp

5.300

0.266295

9

1959-12-31

realgdp

2785.204

-1.267871

By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Wide to Long

An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:

df = pd.DataFrame({
'key': ['foo', 'bar', 'baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]
})

df

key

A

B

C

0

foo

1

4

7

1

bar

2

5

8

2

baz

3

6

9

The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:

melted = pd.melt(df, ['key'])

melted

key

variable

value

0

foo

A

1

1

bar

A

2

2

baz

A

3

3

foo

B

4

4

bar

B

5

5

baz

B

6

6

foo

C

7

7

bar

C

8

8

baz

C

9

Using pivot, we can reshape back to the original layout:(布局)

reshaped = melted.pivot('key', 'variable', 'value')

reshaped

variable

A

B

C

key

bar

2

5

8

baz

3

6

9

foo

1

4

7

Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

reshaped.reset_index()

variable

key

A

B

C

0

bar

2

5

8

1

baz

3

6

9

2

foo

1

4

7

You can also specify a subset of columns to use as value columns:

pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

key

variable

value

0

foo

A

1

1

bar

A

2

2

baz

A

3

3

foo

B

4

4

bar

B

5

5

baz

B

6

pandas.melt can be used without any group identifiers, too:

pd.melt(df, value_vars=['A', 'B', 'C'])

variable

value

0

A

1

1

A

2

2

A

3

3

B

4

4

B

5

5

B

6

6

C

7

7

C

8

8

C

9

pd.melt(df, value_vars=['key', 'A', 'B'])

variable

value

0

key

foo

1

key

bar

2

key

baz

3

A

1

4

A

2

5

A

3

6

B

4

7

B

5

8

B

6

小结

Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.

耐心和恒心, 总会获得回报的.