01Pandas_数据结构

原创

刘旺學長 2022-07-04 20:39:11 博主文章分类：数据分析 ©著作权

©著作权归作者所有：来自51CTO博客作者刘旺學長的原创作品，请联系作者获取转载授权，否则将追究法律责任

Pandas数据结构

做python数据分析，数据挖掘，机器学习的童鞋应该都离不开pandas。在做数据的预处理的时候pandas尤为给力。

本文主要介绍pandas中的两种数据结构：series,dataframe。

import pandas as pd

1.Series

首先来介绍series数据结构。

series 类似于一维数组的对象。对于series基本要掌握的是：

构建series
获取series中的数据与索引
预览数据
通过索引获取数据
Series的运算
name属性

1.1 构建Series

通过list构建Series

向pd.Series()中传入一个list。就等于将这个list转换成了Series数据格式了。

可以通过打印数据类型来检查，显示的是Series

ser_obj = pd.Series(range(10, 20))

print type(ser_obj)

<class 'pandas.core.series.Series'>

通过字典dict构建Series

dict中每个key其实是索引，对应的value是值。所有的值的数据类型需一致。

year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print ser_obj2.head()

2001    17.8
2002    20.1
2003    16.5
dtype: float64

1.2 获取数据与索引

对于Series，使用.values方法就能获取它的值；使用.index方法就能获取它的索引。

下面这个例子获取的索引并没有直接逐个打印出来，而是打印了一个RangeIndex，里面的参数表示起始数（包括），结尾数（不包括），步长为1。

# 获取数据
print ser_obj.values

# 获取索引
print ser_obj.index

[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)

1.3 预览数据

如果数据量太大，但又想看看数据的格式，那么可以提取前几条数据来瞧一瞧。

直接使用.head()，如果里面不传入参数，那么默认提取前5条数据；括号里也可以出传入参数来指定提取前面n条。

# 预览数据
print ser_obj.head(3)

0    10
1    11
2    12
dtype: int64

1.4 获取数据

可以通过索引获取Series中对应位置的value。索引放在中括号[]中。

#通过索引获取数据
print ser_obj[0]
print ser_obj[8]

10
18

1.5 运算

对1个Series 进行加减乘数的运算时，表示对Series中的每个元素都做一次运算，然后输出相同长度的Series。

# 索引与数据的对应关系仍保持在数组运算的结果中
print ser_obj * 3

0    30
1    33
2    36
3    39
4    42
5    45
6    48
7    51
8    54
9    57
dtype: int64

除了普通的加减乘除等运算，还可以进行布尔运算，如下，会将所有大于15的值输出成True，小于15的值输出成False。

print ser_obj > 15

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

1.6 name属性

可以对Series中的Index和Values添加自定义的名字。

# name属性
ser_obj2.name = 'score'
ser_obj2.index.name = 'year'
print ser_obj2.head()

year
2001    17.8
2002    20.1
2003    16.5
Name: score, dtype: float64

2.DataFrame

DataFrame类似于多维数组或表格数据，与excel类似。

每列数据可以是不同的类型，但是同一列的数据需保持一致数据类型。

DataFrame的索引包括行索引与列索引。

掌握DataFrame的基本使用，需要熟悉以下几个要点。

构建DataFrame的两种方法：ndarray构建，dict构建
通过索引获取数据
增加与删除数据

import numpy as np

2.1 构建DataFrame

通过ndarray构建DataFram

# 首先创建一个ndarray （大小是5*4）
array = np.random.randn(5,4)
print array

# 将ndarray传入pd.DataFrame()中，即得到了一个DataFrame
df_obj = pd.DataFrame(array)
print df_obj.head()

[[-1.15943918  0.41562598  0.24219151 -0.54127251]
 [-0.72949761  0.7299977  -0.35770911 -1.55597979]
 [-0.26508669  0.73079105  0.019037   -0.28775191]
 [ 2.35757276  0.54826604 -1.10932131  0.36925581]
 [ 0.60940029  0.11843865 -0.30061918  0.44980428]]
          0         1         2         3
0 -1.159439  0.415626  0.242192 -0.541273
1 -0.729498  0.729998 -0.357709 -1.555980
2 -0.265087  0.730791  0.019037 -0.287752
3  2.357573  0.548266 -1.109321  0.369256
4  0.609400  0.118439 -0.300619  0.449804

上面构建好的DataFrame可见左边有一列是行索引，上面有一行是列索引。如果没有特殊指定，系统会默认生成行索引与列索引的。

通过dict构建DataFrame

还记得通过字典构建series时，Key是作为索引的；在DataFrame中，Key是作为列索引（列名）。

讲dict传给pd.DataFrame()中即构成了一个DataFrame

dict_data = {'A': 1., 
             'B': pd.Timestamp('20161223'),
             'C': pd.Series(1, index=list(range(4)),dtype='float32'),
             'D': np.array([3] * 4,dtype='int32'),
             'E' : pd.Categorical(["Python","Java","C++","C#"]),
             'F' : 'wangxiaocao' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print df_obj2.head()

A          B    C  D       E            F
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao

2.2 通过索引获取数据

这里先简单介绍一下通过列索引来获取数据。

通过列索引获取的数据顾名思义就是获取处该索引的一整列。着一整列的数据其实就是Series的数据格式。

所以DataFrame可以看成是由一列一列的series组成的。

有两种方式：
1. df_obj2[‘F’]
2. df_obj2.F

# 方式1
print df_obj2['F']
print type(df_obj2['F'])

# 方式2
print df_obj2.F

0    wangxiaocao
1    wangxiaocao
2    wangxiaocao
3    wangxiaocao
Name: F, dtype: object
<class 'pandas.core.series.Series'>
0    wangxiaocao
1    wangxiaocao
2    wangxiaocao
3    wangxiaocao
Name: F, dtype: object

2.3 增加与删除列

# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print df_obj2.head()

A          B    C  D       E            F  G
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao  7
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao  7
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao  7
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao  7

# 删除列
del df_obj2['G'] 
print df_obj2.head()

A          B    C  D       E            F
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao

3.索引对象 Index

pandas的两种数据格式都与索引息息相关，这里罗列一下索引的相关知识。

首先要明确索引的特性：不可变！索引

# 索引对象不可变
df_obj2.index[0] = 2

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-17-7f40a356d7d1> in <module>()
      1 # 索引对象不可变
----> 2 df_obj2.index[0] = 2


/home/cc/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in __setitem__(self, key, value)
   1243 
   1244     def __setitem__(self, key, value):
-> 1245         raise TypeError("Index does not support mutable operations")
   1246 
   1247     def __getitem__(self, key):


TypeError: Index does not support mutable operations

常见的Index种类有：

Index
Int64Index
MultiIndex:层级索引
DatetimeINdex：时间戳类型的索引

print type(ser_obj.index)
print type(df_obj2.index)

print df_obj2.index

<class 'pandas.indexes.range.RangeIndex'>
<class 'pandas.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

注：部分例子来自于小象学院Robin课程

上一篇：linux下的安装：openssl

下一篇：【解决】Win10修改host没有权限问题

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯