索引重塑reshape
There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.
多层索引重塑
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.
unstack
This pivots from the rows into the columns.
I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
number | one | two | three |
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
Using the stack method on this data pivots the columns into the rows, producing a Series.
From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack
number | one | two | three |
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.
state | Ohio | Colorado |
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
state | Ohio | Colorado |
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.
a | b | c | d | e | |
one | 0.0 | 1.0 | 2.0 | 3.0 | NaN |
two | NaN | NaN | 4.0 | 5.0 | 6.0 |
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
side | left | right | |
state | number | ||
Ohio | one | 0 | 5 |
two | 1 | 6 | |
three | 2 | 7 | |
Colorado | one | 3 | 8 |
two | 4 | 9 | |
three | 5 | 10 |
side | left | right | ||
state | Ohio | Colorado | Ohio | Colorado |
number | ||||
one | 0 | 3 | 5 | 8 |
two | 1 | 4 | 6 | 9 |
three | 2 | 5 | 7 | 10 |
When calling stack, we can indicate the name of the axis to stack:
state | Colorado | Ohio | |
number | side | ||
one | left | 3 | 0 |
right | 8 | 5 | |
two | left | 4 | 1 |
right | 9 | 6 | |
three | left | 5 | 2 |
right | 10 | 7 |
长转宽形
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:
year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.98 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
1 | 1959.0 | 2.0 | 2778.801 | 1733.7 | 310.859 | 481.301 | 1919.7 | 29.15 | 141.7 | 3.08 | 5.1 | 177.830 | 2.34 | 0.74 |
2 | 1959.0 | 3.0 | 2775.488 | 1751.8 | 289.226 | 491.260 | 1916.4 | 29.35 | 140.5 | 3.82 | 5.3 | 178.657 | 2.74 | 1.09 |
3 | 1959.0 | 4.0 | 2785.204 | 1753.7 | 299.356 | 484.052 | 1931.3 | 29.37 | 140.0 | 4.33 | 5.6 | 179.386 | 0.27 | 4.06 |
4 | 1960.0 | 1.0 | 2847.699 | 1770.5 | 331.722 | 462.199 | 1955.5 | 29.54 | 139.6 | 3.50 | 5.2 | 180.007 | 2.31 | 1.19 |
date | item | value | |
0 | 1959-03-31 | realgdp | 2710.349 |
1 | 1959-03-31 | infl | 0.000 |
2 | 1959-03-31 | unemp | 5.800 |
3 | 1959-06-30 | realgdp | 2778.801 |
4 | 1959-06-30 | infl | 2.340 |
5 | 1959-06-30 | unemp | 5.100 |
6 | 1959-09-30 | realgdp | 2775.488 |
7 | 1959-09-30 | infl | 2.740 |
8 | 1959-09-30 | unemp | 5.300 |
9 | 1959-12-31 | realgdp | 2785.204 |
This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:
item | infl | realgdp | unemp |
date | |||
1959-03-31 | 0.00 | 2710.349 | 5.8 |
1959-06-30 | 2.34 | 2778.801 | 5.1 |
1959-09-30 | 2.74 | 2775.488 | 5.3 |
1959-12-31 | 0.27 | 2785.204 | 5.6 |
1960-03-31 | 2.31 | 2847.699 | 5.2 |
The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:
date | item | value | valu2 | |
0 | 1959-03-31 | realgdp | 2710.349 | -0.143460 |
1 | 1959-03-31 | infl | 0.000 | -0.422318 |
2 | 1959-03-31 | unemp | 5.800 | 0.389872 |
3 | 1959-06-30 | realgdp | 2778.801 | -0.208526 |
4 | 1959-06-30 | infl | 2.340 | -1.538956 |
5 | 1959-06-30 | unemp | 5.100 | -0.143273 |
6 | 1959-09-30 | realgdp | 2775.488 | 0.385763 |
7 | 1959-09-30 | infl | 2.740 | 0.564365 |
8 | 1959-09-30 | unemp | 5.300 | 0.266295 |
9 | 1959-12-31 | realgdp | 2785.204 | -1.267871 |
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
Wide to Long
An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:
key | A | B | C | |
0 | foo | 1 | 4 | 7 |
1 | bar | 2 | 5 | 8 |
2 | baz | 3 | 6 | 9 |
The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:
key | variable | value | |
0 | foo | A | 1 |
1 | bar | A | 2 |
2 | baz | A | 3 |
3 | foo | B | 4 |
4 | bar | B | 5 |
5 | baz | B | 6 |
6 | foo | C | 7 |
7 | bar | C | 8 |
8 | baz | C | 9 |
Using pivot, we can reshape back to the original layout:(布局)
variable | A | B | C |
key | |||
bar | 2 | 5 | 8 |
baz | 3 | 6 | 9 |
foo | 1 | 4 | 7 |
Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:
variable | key | A | B | C |
0 | bar | 2 | 5 | 8 |
1 | baz | 3 | 6 | 9 |
2 | foo | 1 | 4 | 7 |
You can also specify a subset of columns to use as value columns:
key | variable | value | |
0 | foo | A | 1 |
1 | bar | A | 2 |
2 | baz | A | 3 |
3 | foo | B | 4 |
4 | bar | B | 5 |
5 | baz | B | 6 |
pandas.melt can be used without any group identifiers, too:
variable | value | |
0 | A | 1 |
1 | A | 2 |
2 | A | 3 |
3 | B | 4 |
4 | B | 5 |
5 | B | 6 |
6 | C | 7 |
7 | C | 8 |
8 | C | 9 |
variable | value | |
0 | key | foo |
1 | key | bar |
2 | key | baz |
3 | A | 1 |
4 | A | 2 |
5 | A | 3 |
6 | B | 4 |
7 | B | 5 |
8 | B | 6 |
小结
Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.
耐心和恒心, 总会获得回报的.