一、pandas数据结构介绍

1. Series

Series是一种一维的数组性对象,它包含了一个值序列,并且包含了数据标签,称为索引。最简单的序列可以仅仅由一个数组形成:

import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

在交互式环境中Series的索引在左边,值在右边。由于我们不为数据制定索引,默认索引都是从0到N-1.你可以通过values属性和index属性分别获得Series对象的值和索引:

obj.values
array([ 4,  7, -5,  3])
obj.index
RangeIndex(start=0, stop=4, step=1)

与numpy数组相比,你可以使用标签来进行索引:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2['a']
-5
obj2[1]
7
obj2[['c', 'a', 'd']]
c    3
a   -5
d    4
dtype: int64

与numpy类似,Series可以使用numpy的函数或者操作,比如布尔值数组进行过滤等:

obj2[obj2 > 0]
d    4
b    7
c    3
dtype: int64
obj2 * 2
d     8
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

其实,Series可以看成是一个长度固定且有序的字典:

'b' in obj2
True
sdata = {'Ohio' : 3500, 'Texas' : 71000, 'Oregon' : 16000, 'Utah' : 5000}
obj3 = pd.Series(sdata)
obj3
Ohio       3500
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pandas利用isnull和notnull来检查缺失数据:

pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

在很多应用中,数学操作自动对齐索引是Series的一个非常有用的特性:

obj3
Ohio       3500
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
obj4
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64
obj3 + obj4
California         NaN
Ohio            7000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象自身和其索引都有name属性,这个特性与pandas其他重要功能集成在一起:

obj4.index.name = 'states'
obj4
states
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

2. Dataframe

dataframe既有行索引也有列索引,它可以被视为一个共享相同索引的Series。尽管dataframe是二维的,但是你可以利用分层索引在dataframe中展现更高维度的数据。

有多种方式可以创建Dataframe,其中最常用的方式是利用包含等长度列表或numpy数组的字典来形成dataframe:

data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year' : [2000, 2001, 2002, 2001, 2002, 2003], 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame



pop

state

year

0

1.5

Ohio

2000

1

1.7

Ohio

2001

2

3.6

Ohio

2002

3

2.4

Nevada

2001

4

2.9

Nevada

2002

5

3.2

Nevada

2003

如果你指定了列的顺序,Dataframe的列将会按照指定的顺序排列:

pd.DataFrame(data, columns=['year', 'state', 'pop'])



year

state

pop

0

2000

Ohio

1.5

1

2001

Ohio

1.7

2

2002

Ohio

3.6

3

2001

Nevada

2.4

4

2002

Nevada

2.9

5

2003

Nevada

3.2

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
print(frame2)
frame2.columns
year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN





Index(['year', 'state', 'pop', 'debt'], dtype='object')

dataframe中的一列,可以按字典型标记或属性那样检索为Series:

frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.state
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

frame[colunm]对于任意列名均有效,但是frame2.column只会在列名是有效的python变量名时有效
请注意,返回的Series与原dataframe有相同的索引,且Series的name属性也会被合理地设置。

行也可以通过位置或特殊属性loc进行选取:

frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
frame2.loc['three', 'state']
'Ohio'

列的引用是可以修改的:

frame2['debt'] = 16.5
frame2



year

state

pop

debt

one

2000

Ohio

1.5

16.5

two

2001

Ohio

1.7

16.5

three

2002

Ohio

3.6

16.5

four

2001

Nevada

2.4

16.5

five

2002

Nevada

2.9

16.5

six

2003

Nevada

3.2

16.5

frame2['debt'] = np.arange(6)
frame2



year

state

pop

debt

one

2000

Ohio

1.5

0

two

2001

Ohio

1.7

1

three

2002

Ohio

3.6

2

four

2001

Nevada

2.4

3

five

2002

Nevada

2.9

4

six

2003

Nevada

3.2

5

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2



year

state

pop

debt

one

2000

Ohio

1.5

NaN

two

2001

Ohio

1.7

-1.2

three

2002

Ohio

3.6

NaN

four

2001

Nevada

2.4

-1.5

five

2002

Nevada

2.9

-1.7

six

2003

Nevada

3.2

NaN

可以用del关键字删除dataframe的一列,首先我们增加一列:

frame2['eastern'] = frame2.state == 'Hhio'
frame2



year

state

pop

debt

reastern

eastern

one

2000

Ohio

1.5

NaN

False

False

two

2001

Ohio

1.7

-1.2

False

False

three

2002

Ohio

3.6

NaN

False

False

four

2001

Nevada

2.4

-1.5

False

False

five

2002

Nevada

2.9

-1.7

False

False

six

2003

Nevada

3.2

NaN

False

False

del frame2['eastern']
frame2



year

state

pop

debt

reastern

one

2000

Ohio

1.5

NaN

False

two

2001

Ohio

1.7

-1.2

False

three

2002

Ohio

3.6

NaN

False

four

2001

Nevada

2.4

-1.5

False

five

2002

Nevada

2.9

-1.7

False

six

2003

Nevada

3.2

NaN

False

从dataframe中选取的列是数据的视图,而不是拷贝。因此,对Series的修改会映射到dataframe中。如果需要复制,则应当显式地使用Series的copy方法。

你可以用类似numpy的语法对dataframe进行转置操作:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3



Nevada

Ohio

2000

NaN

1.5

2001

2.4

1.7

2002

2.9

3.6

frame3.T



2000

2001

2002

Nevada

NaN

2.4

2.9

Ohio

1.5

1.7

3.6

包含Series的字典也可以用于构造dataframe:

pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)



Nevada

Ohio

2000

NaN

1.5

2001

2.4

1.7

设置name属性

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3



state

Nevada

Ohio

year

2000

NaN

1.5

2001

2.4

1.7

2002

2.9

3.6

dataframe的values属性:

frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
frame2.values
array([[2000, 'Ohio', 1.5, nan, False],
       [2001, 'Ohio', 1.7, -1.2, False],
       [2002, 'Ohio', 3.6, nan, False],
       [2001, 'Nevada', 2.4, -1.5, False],
       [2002, 'Nevada', 2.9, -1.7, False],
       [2003, 'Nevada', 3.2, nan, False]], dtype=object)

3. 索引对象

在构造Series或dataframe时,你所使用的任意数组或标签序列都可以在内部转换为索引对象:

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')

索引对象是不可变的,因此用户是无法修改索引对象的:

index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-70-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1668 
   1669     def __setitem__(self, key, value):
-> 1670         raise TypeError("Index does not support mutable operations")
   1671 
   1672     def __getitem__(self, key):


TypeError: Index does not support mutable operations

不可变性使得在多种数据结构中分享索引对象更为安全:

labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64
obj2.index is labels
True

除了类似数组,索引对象也像一个固定大小的集合:

frame3



state

Nevada

Ohio

year

2000

NaN

1.5

2001

2.4

1.7

2002

2.9

3.6

frame3.columns
Index(['Nevada', 'Ohio'], dtype='object', name='state')
'Ohio' in frame3.columns
True
2003 in frame3.index
False

一些索引对象的方法和属性:

方法

描述

append

将额外的索引对象粘贴到原索引后,产生一个新的索引

difference

计算两个索引的差集

intersrction

计算两个索引的交集

union

计算两个索引的并集

isin

计算表示每一个值是否在传值容器中的布尔数组

delete

将位置i的元素删除,并产生新的索引

drop

根据传参删除指定索引值,并产生新的索引

insert

在位置i插入元素,并产生新的索引

is_monotonic

如果索引序列递增则返回True

is_unique

如果索引序列唯一则返回True

unique

计算索引的唯一值序列