Series和DataFrame、相关性及NaN处理

原创

刘旺學長 2022-07-04 20:40:55 ©著作权

文章标签 数组数据自定义函数 文章分类 云平台云计算

©著作权归作者所有：来自51CTO博客作者刘旺學長的原创作品，请联系作者获取转载授权，否则将追究法律责任

pandas核心数据结构

pandas是以numpy为基础的，还提供了一些额外的方法

Series

series用来表示一维数据结构，与python内部的数组类似，但多了一些额外的功能。

series内部由两个相互关联的数组组成：主数组用来存放数组，可以是numpy中的任意数据类型；另一个数组用来存放索引，索引默认从0开始。朱数组中每个元素又有一个与之关联的索引。

Series和DataFrame、相关性及NaN处理_数据

创建series对象

1、通过series的构造方法，参数为数组

Series和DataFrame、相关性及NaN处理_数组_02

通过参数index也可以指定索引

Series和DataFrame、相关性及NaN处理_数据_03

2、也可以通过传入ndarray创建series

Series和DataFrame、相关性及NaN处理_数组_04

注意：此时修改series中元素会对原ndarray有影响

Series和DataFrame、相关性及NaN处理_自定义函数_05

3、还可以传入一个series对象，会返回一个新的series对象但仍指向同一地址

Series和DataFrame、相关性及NaN处理_数组_06

注意：此时修改series中的对象会对原series产生影响

Series和DataFrame、相关性及NaN处理_数据_07

4、可以传入空类型np.NaN对象

Series和DataFrame、相关性及NaN处理_数组_08

5、传入字典

在series的构造函数中传入一个字典，那么字典的key则为index，value为series的values元素

Series和DataFrame、相关性及NaN处理_自定义函数_09

series对象的属性与方法

1、查看series的索引和值

Series和DataFrame、相关性及NaN处理_数据_10

2、series的长度

Series和DataFrame、相关性及NaN处理_数据_11

3、获取不重复的series

通过调用series对象的unique()方法返回一个无重复元素的series

Series和DataFrame、相关性及NaN处理_自定义函数_12

4、统计重复元素出现的次数

series对象的value_counts()会返回一个统计了元素-次数的series

Series和DataFrame、相关性及NaN处理_数组_13

5、判断是否包含某些元素

isin()方法传入一个条件可以判断series是否包含某些元素，返回的是一个series

Series和DataFrame、相关性及NaN处理_自定义函数_14

返回的布尔类型series传给原series可以进行筛选满足条件的元素

Series和DataFrame、相关性及NaN处理_数组_15

6、判断元素是否为null或非null

isnull()返回一个布尔类型的series

Series和DataFrame、相关性及NaN处理_数据_16

非空即调用notnull()方法

通过isnull()方法

Series和DataFrame、相关性及NaN处理_数据_17

7、获取最小最大值的索引

通过调用idxmin()与idxmax()

Series和DataFrame、相关性及NaN处理_自定义函数_18

获取内部元素

支持使用从0开始的索引访问元素或指定索引值

Series和DataFrame、相关性及NaN处理_自定义函数_19

同样series也支持切片

Series和DataFrame、相关性及NaN处理_数组_20

筛选元素

可以对series对象直接进行逻辑运算，但回返回一个布尔类型的series

Series和DataFrame、相关性及NaN处理_数据_21

通过传递布尔类型的series可以进行筛选元素

Series和DataFrame、相关性及NaN处理_自定义函数_22

series的运算

1、series的运算是针对values中的每一个元素的

Series和DataFrame、相关性及NaN处理_数据_23

numpy提供了许多运算方法，都可以将series传入

Series和DataFrame、相关性及NaN处理_自定义函数_24

2、多个series进行运算时，具有相同index的value会进行运算，若无相同idex，则该value的运算结果为NaN

Series和DataFrame、相关性及NaN处理_数据_25

DataFrame

DataFrame数据结构与关系型表格类似，是多维的series，它的"values"为colunms，即多列，每一列的数据类型可以不相同

Series和DataFrame、相关性及NaN处理_数组_26

创建DataFrame对象

1、传递一个字典对象给DataFrame的构造函数，dict的key为每一列的列名，value作为列元素

Series和DataFrame、相关性及NaN处理_自定义函数_27

还可以指定字典中的部分kv对装载到dataframe中

Series和DataFrame、相关性及NaN处理_自定义函数_28

自然也可以自定义行标签index

Series和DataFrame、相关性及NaN处理_数据_29

2、传入元素数组、index数组和列名数组

Series和DataFrame、相关性及NaN处理_自定义函数_30

获取元素

1、通过columns属性查看列名

Series和DataFrame、相关性及NaN处理_数组_31

2、通过index属性查看行名

Series和DataFrame、相关性及NaN处理_自定义函数_32

3、通过values属性获取元素

Series和DataFrame、相关性及NaN处理_数组_33

4、获取某一列的内容

用列名检索

Series和DataFrame、相关性及NaN处理_数组_34

若列名为字符串类型，可以直接通过以列名为属性获取

Series和DataFrame、相关性及NaN处理_数组_35

Series和DataFrame、相关性及NaN处理_数组_36

5、获取某一行的内容

通过DataFrame.icon[index]实现

还可以通过行名进行索引

索引多行在icon后传入列表即可

6、切分

同样dataframe底层为ndarray

Series和DataFrame、相关性及NaN处理_数据_37

7、获取某一值

需要指定两个维度，注意列名在前

Series和DataFrame、相关性及NaN处理_自定义函数_38

dataframe为行列起名

index和columns默认名为空

Series和DataFrame、相关性及NaN处理_数据_39

dataframe相关操作

1、添加一列

Series和DataFrame、相关性及NaN处理_数据_40

一列即为一个series，所以可以直接传入一个series。注意series中的index需要与dataframe中的行名相同

Series和DataFrame、相关性及NaN处理_自定义函数_41

2、判断是否包含某元素

与series相同，可以使用isin()方法，并获取符合条件的元素

Series和DataFrame、相关性及NaN处理_数据_42

Series和DataFrame、相关性及NaN处理_数组_43

3、删除某列

通过del()方法

Series和DataFrame、相关性及NaN处理_数组_44

4、支持逻辑运算符进行筛选

与series相同

Series和DataFrame、相关性及NaN处理_自定义函数_45

5、行列交换

底层为二维ndarray，即矩阵，可转置。通过T属性

Series和DataFrame、相关性及NaN处理_数组_46

Index对象

index对象在series和dataframe中都十分重要，很多操作都是针对index对象进行优化

判断index是否唯一

通过index对象的is_unique属性判断

Series和DataFrame、相关性及NaN处理_数据_47

更换索引

通过series的reindex()方法可以交换原先索引位置，对于未出现过的索引名对应的元素为NaN

Series和DataFrame、相关性及NaN处理_数据_48

填充索引

若series对象中索引缺失了很多项，也可以通过reindex()来填充索引

1、method为ffill(forward fill)，即向前填充。缺失的索引对应的元素为之前的第一个出现索引的值

Series和DataFrame、相关性及NaN处理_数组_49

2、bfill即backward fill，向后填充。缺失的索引对应的元素为之后的第一个出现索引的值

Series和DataFrame、相关性及NaN处理_自定义函数_50

3、对于dataframe的reindex

同样可以对dataframe进行填充列

bfill为向后(右)填充，ffill为向左

Series和DataFrame、相关性及NaN处理_数组_51

删除索引

1、通过drop()方法删除索引，并返回删除的索引-值，会返回一个新的series

Series和DataFrame、相关性及NaN处理_数组_52

原series不会发生变化

Series和DataFrame、相关性及NaN处理_数据_53

2、dataframe中删除索引

同样返回一个新的dataframe

Series和DataFrame、相关性及NaN处理_数据_54

还可以删除列，通过指定axis=1

Series和DataFrame、相关性及NaN处理_数据_55

算数和数据对齐

1、相同数据结构之间的运算

两个series进行运算时，只有相同索引的元素才会进行运算

Series和DataFrame、相关性及NaN处理_数据_56

dataframe也是类似的，只有列名和index相同的元素才会运算

2、series和dataframe之间的与运算

Series和DataFrame、相关性及NaN处理_自定义函数_57

实际上的df中的每一列与serise进行运算

Series和DataFrame、相关性及NaN处理_自定义函数_58

若存在不共有的index，则该index对应的值为NaN

Series和DataFrame、相关性及NaN处理_数据_59

numpy函数应用与自定义函数

pandas是以numpy为基础的，ufunc就是经过扩展的通用函数，这类函数能够读数据结构中的元素进行操作

numpy中的函数

1、例如求平方根

可以直接通过numpy中的sqrt()方法，传入一个series或dataframe对象

Series和DataFrame、相关性及NaN处理_自定义函数_60

2、统计函数

使用axis=0指定应用于列，axis=1指定应用于行

Series和DataFrame、相关性及NaN处理_数组_61

其他sum,max等函数皆可用

使用describe()函数可以查看所有统计量

Series和DataFrame、相关性及NaN处理_自定义函数_62

自定义函数

自定义函数是对一维数组进行运算的，返回结构是一个数值。使用dataf或seri上的apply()方法应用自定义函数。针对每一行或每一列，使用axis=0指定应用于列，axis=1指定应用于行

1、dataframe上自定义函数求行或列的平方和

Series和DataFrame、相关性及NaN处理_数据_63

关于axis=1还是0：

Series和DataFrame、相关性及NaN处理_数据_64

2、使用lamdba表达式

series上自定义函数求平方可以直接写lambda表达式

Series和DataFrame、相关性及NaN处理_数组_65

3、自定义函数返回series

apply函数并不一定返回一个标量，也可以是一个series

例如求dataframe中每一行或每一列的最大值和最小值

Series和DataFrame、相关性及NaN处理_数据_66

Series和DataFrame、相关性及NaN处理_数组_67

Series和DataFrame、相关性及NaN处理_数组_68

Series和DataFrame的排序和排位

Series排序

Series和DataFrame、相关性及NaN处理_数据_69

1、按index排序

通过sort_index()方法，ascending为True则为升序。默认升序

Series和DataFrame、相关性及NaN处理_自定义函数_70

2、按元素值排序

通过sort_values()方法。默认升序

Series和DataFrame、相关性及NaN处理_数组_71

dataframe排序

Series和DataFrame、相关性及NaN处理_数据_72

1、按索引排序

同上sort_index()。ascending默认为True，axis默认为0

Series和DataFrame、相关性及NaN处理_数据_73

2、按column排序

sort_index()中声明axis为1

Series和DataFrame、相关性及NaN处理_自定义函数_74

3、列中按元素值排序

sort_index()中by指定需要排序的列

Series和DataFrame、相关性及NaN处理_数组_75

注意：sort_values()不支持同一行的排序

排位

排位指的是对元素值大小进行排序后返回在序列中的位置，比如从小到大排在第几位

1、Series的排位

通过rank()

Series和DataFrame、相关性及NaN处理_自定义函数_76

2、DataFrame的排位

为series类似，但需要指明axis，0为每列排位，1位每行排位

NaN的数据处理

1、创建NaN数据

在构造数据时，可以直接赋值NaN，即调用numpy的nan

构造series过程中为series的值赋值为nan

Series和DataFrame、相关性及NaN处理_数据_86

也可以None来赋值

Series和DataFrame、相关性及NaN处理_数据_87

2、删除NaN数据

若NaN在整个数据集中占比较小，可考虑直接删除

通过dropna()实现

Series和DataFrame、相关性及NaN处理_数组_88

在dataframe中需要指明axis，会删除含有NaN数据的一整行或一整列

Series和DataFrame、相关性及NaN处理_数据_89

还可以设置删除策略，在dropna()函数中how为'all'是整行或整列全为NaN时才会删除。默认为'any'

Series和DataFrame、相关性及NaN处理_自定义函数_90

3、判断为空或非空

但最好在副本上进行删除，通过notnull()可以返回索引对应值是否为空的布尔类型series

Series和DataFrame、相关性及NaN处理_数组_91

判断为空则调用isnull()，可以筛选出为NaN的数据

Series和DataFrame、相关性及NaN处理_数组_92

4、填充空值

1、通过fillna()实现，传入参数表示填的值

Series和DataFrame、相关性及NaN处理_数组_93

2、还可以在fillna()中传入字典指定列名与填充值

Series和DataFrame、相关性及NaN处理_数据_94

上一篇：【数据分析】苹果公司股票数据处理

下一篇：Django中的认证与权限源码剖析

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯