python dataframe 分层索引loc dataframe有哪些索引

转载

lingyuli 2023-09-27 15:51:41

文章标签 转置数据二维 文章分类 Python 后端开发

1.DataFrame常用属性、函数以及索引方式

1.1DataFrame简介

　　DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。DataFrame可以通过类似字典的方式或者.columnname的方式将列获取为一个Series。行也可以通过位置或名称的方式进行获取。

　　　　为不存在的列赋值会创建新列。

　　　　>>> del frame['xxx']　　# 删除列

1.2DataFrame常用属性

属性	说明
values	DataFrame的值
index	行索引
index.name	行索引的名字
columns	列索引
columns.name	列索引的名字
ix	返回行的DataFrame
ix[[x,y,...], [x,y,...]]	对行重新索引，然后对列重新索引
T	frame行列转置

1.3DataFrame常用函数

1.3.1函数	说明
DataFrame(dict, columns=dict.index, index=[dict.columnnum]) DataFrame(二维ndarray) DataFrame(由数组、列表或元组组成的字典) DataFrame(NumPy的结构化/记录数组) DataFrame(由Series组成的字典) DataFrame(由字典组成的字典) DataFrame(字典或Series的列表) DataFrame(由列表或元组组成的列表) DataFrame(DataFrame) DataFrame(NumPy的MaskedArray)	构建DataFrame 数据矩阵，还可以传入行标和列标每个序列会变成DataFrame的一列。所有序列的长度必须相同类似于“由数组组成的字典” 每个Series会成为一列。如果没有显式制定索引，则各Series的索引会被合并成结果的行索引各内层字典会成为一列。键会被合并成结果的行索引。各项将会成为DataFrame的一行。索引的并集会成为DataFrame的列标。类似于二维ndarray 沿用DataFrame 类似于二维ndarray，但掩码结果会变成NA/缺失值
df.reindex([x,y,...], fill_value=NaN, limit) df.reindex([x,y,...], method=NaN) df.reindex([x,y,...], columns=[x,y,...],copy=True)	返回一个适应新索引的新对象，将缺失值填充为fill_value，最大填充量为limit 返回适应新索引的新对象，填充方式为method 同时对行和列进行重新索引，默认复制新对象。
df.drop(index, axis=0)	丢弃指定轴上的指定项。

1.3.2排序函数	说明
df.sort_index(axis=0, ascending=True) df.sort_index(by=[a,b,...])	根据索引排序

1.3.3汇总统计函数	说明
df.count()	非NaN的数量
df.describe()	一次性产生多个汇总统计
df.min() df.min()	最小值最大值
df.idxmax(axis=0, skipna=True) df.idxmin(axis=0, skipna=True)	返回含有最大值的index的Series 返回含有最小值的index的Series
df.quantile(axis=0)	计算样本的分位数
df.sum(axis=0, skipna=True, level=NaN) df.mean(axis=0, skipna=True, level=NaN) df.median(axis=0, skipna=True, level=NaN) df.mad(axis=0, skipna=True, level=NaN) df.var(axis=0, skipna=True, level=NaN) df.std(axis=0, skipna=True, level=NaN) df.skew(axis=0, skipna=True, level=NaN) df.kurt(axis=0, skipna=True, level=NaN) df.cumsum(axis=0, skipna=True, level=NaN) df.cummin(axis=0, skipna=True, level=NaN) df.cummax(axis=0, skipna=True, level=NaN) df.cumprod(axis=0, skipna=True, level=NaN) df.diff(axis=0) df.pct_change(axis=0)	返回一个含有求和小计的Series 返回一个含有平均值的Series 返回一个含有算术中位数的Series 返回一个根据平均值计算平均绝对离差的Series 返回一个方差的Series 返回一个标准差的Series 返回样本值的偏度（三阶距）返回样本值的峰度（四阶距）返回样本的累计和返回样本的累计最大值返回样本的累计最小值返回样本的累计积返回样本的一阶差分返回样本的百分比数变化


1.3.4计算函数	说明
df.add(df2, fill_value=NaN, axist=1) df.sub(df2, fill_value=NaN, axist=1) df.div(df2, fill_value=NaN, axist=1) df.mul(df2, fill_value=NaN, axist=1)	元素级相加，对齐时找不到元素默认用fill_value 元素级相减，对齐时找不到元素默认用fill_value 元素级相除，对齐时找不到元素默认用fill_value 元素级相乘，对齐时找不到元素默认用fill_value
df.apply(f, axis=0)	将f函数应用到由各行各列所形成的一维数组上
df.applymap(f)	将f函数应用到各个元素上
df.cumsum(axis=0, skipna=True)	累加，返回累加后的dataframe

1.4DataFrame索引方式

索引方式	说明
df[val]	选取DataFrame的单个列或一组列
df.ix[val]	选取Dataframe的单个行或一组行
df.ix[:,val]	选取单个列或列子集
df.ix[val1,val2]	将一个或多个轴匹配到新索引
reindex方法	将一个或多个轴匹配到新索引
xs方法	根据标签选取单行或者单列，返回一个Series
icol、irow方法	根据整数位置选取单列或单行，并返回一个Series
get_value、set_value	根据行标签和列标签选取单个值

运算：默认情况下，Dataframe和Series之间的算术运算会将Series的索引匹配到的Dataframe的列，沿着列一直向下传播。若索引找不到，则会重新索引产生并集。

2.DataFrame常用属性例程

# -*- coding: utf-8 -*-
"""
@author: 蔚蓝的天空Tom

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。
DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。
DataFrame可以通过类似字典的方式或者.columnname的方式将列获取为一个Series。
行也可以通过位置或名称的方式进行获取。

DataFrame常用属性
属性	说明
values	     DataFrame的值
index	     行索引
index.name	行索引的名字
columns	     列索引
columns.name	列索引的名字
ix	          返回行的DataFrame
ix[[x,y,...], [x,y,...]]	对行重新索引，然后对列重新索引
T	frame行列转置
"""

import pandas as pd
from pandas import DataFrame

if __name__=='__main__':
    data = {'Name':['Tom','Kim','Andy'],
            'Age':[18,16,19],
            'Height':[1.6,1.5,1.7]}
    ind = ['No.1', 'No.2', 'No.3']
    df = pd.DataFrame(data, index=ind)
#      Age  Height  Name
#No.1   18     1.6   Tom
#No.2   16     1.5   Kim
#No.3   19     1.7  Andy
    
    #DataFram的值
    v = df.values #<class 'numpy.ndarray'>
#[[18 1.6 'Tom']
# [16 1.5 'Kim']
# [19 1.7 'Andy']]

    #行索引，用户没有自定义行索引index时，返回行索引魔人数值
    ind = df.index #<class 'pandas.indexes.base.Index'>
#Index(['No.1', 'No.2', 'No.3'], dtype='object')

    #行索引的名字，未设置时获取到None
    iname = df.index.name
#None
    #行索引的名字，先设置再获取
    df.index.name = 'StudentID'
    iname = df.index.name
#StudentID
    
    #列索引
    col = df.columns #<class 'pandas.indexes.base.Index'>
#Index(['Age', 'Height', 'Name'], dtype='object')

    #列索引的名字, 未设置时为None
    cname = df.columns.name
#None

    #列索引的名字，先设置再获取
    df.columns.name = 'StudentInfo'
    cname = df.columns.name
#StudentInfo

    #ix, 返回行的DataFrame
    ret = df.ix[0] #返回第一行数据
#Age        18
#Height    1.6
#Name      Tom
#Name: No.1, dtype: object

    #ix, 返回行的DataFrame
    ret = df.ix[1] #返回第二行数据, <class 'pandas.core.series.Series'>
#Age        16
#Height    1.5
#Name      Kim
#Name: No.2, dtype: object
 
    ret = df.ix[-1] #返回最后一行数据
#Age         19
#Height     1.7
#Name      Andy
#Name: No.3, dtype: object
    
    #ix[[rowx, rowy,...]] 对行重新索引，相等于DataFrame切片
    ret = df.ix[[0,2]]
#StudentInfo  Age  Height  Name
#StudentID                     
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim
    
    #ix[[rowx, rowy,...], [colx, coly, ...]]
    ret = df.ix[[0,2], [0,1]]
#StudentInfo  Age  Height
#StudentID               
#No.1          18     1.6
#No.3          19     1.7

    #T	frame行列转置
    print('转置前:\n', df)
#转置前:
#StudentInfo  Age  Height  Name
#StudentID                     
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim
#No.3          19     1.7  Andy
    print('转置前values:\n', df.values)
#转置前values:
# [[18 1.6 'Tom']
# [16 1.5 'Kim']
# [19 1.7 'Andy']]
    
    dfT = df.T
    print('转置后:\n', dfT)
#转置后:
#StudentID   No.1 No.2  No.3
#StudentInfo                
#Age           18   16    19
#Height       1.6  1.5   1.7
#Name         Tom  Kim  Andy
    print('转置后values:\n', dfT.values)
#转置后values:
# [[18 16 19]
# [1.6 1.5 1.7]
# ['Tom' 'Kim' 'Andy']]

    print('转置前index.name:\n', df.index.name)
#StudentID
    print('转置后index.name:\n', dfT.index.name)
#StudentInfo

    print('转置前columns.name:\n', df.columns.name)
#StudentInfo
    print('转置后columns.name:\n', dfT.columns.name)
#StudentID

3.DataFrame常用函数DataFrame()/reindex()/drop()

def DataFrame_manual():
    '''
    DataFrame类型类似于数据库表结构的数据结构，含有行索引和列索引
    可以将DataFrame看成由相同索引的Series组成的Dict类型。
    在其底层是通过二维以及一维的数据块实现
    '''
    import pandas as pd
    from pandas import DataFrame
    
    #1. DataFrame对象的创建
    #1.1用包含等长的列表或者是NumPy数组的字典创建DataFrame对象
    #建立等长列表的字典类型
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    #建立DataFrame对象
    #使用默认索引[0,1,2,....]
    df = pd.DataFrame(data) #默认索引，默认列的顺序
#       Age  Height  Name
#    0   18     1.6   Tom
#    1   16     1.5   Kim
#    2   19     1.7  Andy
    #指定列的顺序
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Height'])
#       Name  Age  Height
#    0   Tom   18     1.6
#    1   Kim   16     1.5
#    2  Andy   19     1.7

    #指定DataFrame的索引
    df = pd.DataFrame(data, index=['1st', '2nd', '3th'])
#         Age  Height  Name
#    1st   18     1.6   Tom
#    2nd   16     1.5   Kim
#    3th   19     1.7  Andy

    #1.2 用嵌套dict生成DataFrame对象
    #用嵌套dict生成DataFrame，外部的dict索引会成为列名，内部的dict索引会成为行名
    #生成的DataFrame会根据行索引排序
    data = {'Name':  {'1st':'Tom', '2nd':'Kim', '3th':'Andy'}, 
            'Age':   {'1st':18,    '2nd':16,    '3th':19}, 
            'Height':{'1st':1.6,   '2nd':1.5,   '3th':1.7}}
    df = pd.DataFrame(data) #使用嵌套dict指定的行序列，使用默认的列序列(列名字典排序)
#         Age  Height  Name
#    1st   18     1.6   Tom
#    2nd   16     1.5   Kim
#    3th   19     1.7  Andy
    df = pd.DataFrame(data, ['3th', '2nd', '1st']) #指定行的序列
#         Age  Height  Name
#    3th   19     1.7  Andy
#    2nd   16     1.5   Kim
#    1st   18     1.6   Tom

    #2访问DataFrame
    #从DataFrame中获取一列的结果为一个Series，有两种方法
    #2.1字典索引方式获取
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Height'], index=['1st', '2nd', '3th'])
#         Name  Age  Height
#    1st   Tom   18     1.6
#    2nd   Kim   16     1.5
#    3th  Andy   19     1.7
    s = df['Name']
#    1st     Tom
#    2nd     Kim
#    3th    Andy
#    Name: Name, dtype: object

    #2.2通过ix获取一行数据
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    s = df.ix['1st'] #获取单行，参数为 行索引值
#    Name      Tom
#    Age        18
#    Height    1.6
#    Name: 1st, dtype: object
    s = df.ix[0] #获取单行，参数 默认数字行索引
#    Name      Tom
#    Age        18
#    Height    1.6
#    Name: 1st, dtype: object
    s = df.ix[['3th', '2nd']]#获取多行
#         Name  Age  Height
#    3th  Andy   19     1.7
#    2nd   Kim   16     1.5
    s = df.ix[range(3)] #通过默认数字行索引获取数据
#          Name  Age  Height
#    1st   Tom   18     1.6
#    2nd   Kim   16     1.5
#    3th  Andy   19     1.7

    #2.3获取指定行，指定列的交汇值
    ret = df['Name']['1st']  #Tom
    ret = df['Name'][0]      #Tom
    ret = df['Age']['1st']   #18
    ret = df['Age'][0]       #18
    ret = df['Height']['1st']#1.6
    ret = df['Height'][0]    #1.6
    
    #2.4获取指定列，指定行的交汇值
    ret = df.ix['1st']['Name']  #Tom
    ret = df.ix[0]['Name']      #Tom
    ret = df.ix['1st']['Age']   #18
    ret = df.ix[0]['Age']       #18
    ret = df.ix['1st']['Height']#1.6
    ret = df.ix[0]['Height']    #1.6

    #3.修改DataFame对象
    #3.1增加列
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    df['Grade'] = 9 #增加一列，年级'Grade'，为同一值9年级
#         Name  Age  Height  Grade
#    1st   Tom   18     1.6      9
#    2nd   Kim   16     1.5      9
#    3th  Andy   19     1.7      9

    #3.2修改一列的值
    df['Grade'] = [6,7,7]
#         Name  Age  Height  Grade
#    1st   Tom   18     1.6      6
#    2nd   Kim   16     1.5      7
#    3th  Andy   19     1.7      7

    #3.3判断Grade是否为7年级
    s = pd.Series([False, True, True], index=['1st', '2nd', '3th'])
    df['HighGrade'] = s #新增一列'HighGrade'，用Series赋值
#         Name  Age  Height  Grade HighGrade
#    1st   Tom   18     1.6      6     False
#    2nd   Kim   16     1.5      7      True
#    3th  Andy   19     1.7      7      True

    #4.命令DataFrame的行、列
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    df.columns.name = 'Students'
    df.index.name = 'ID'
#    Students  Name  Age  Height
#    ID                         
#    1st        Tom   18     1.6
#    2nd        Kim   16     1.5
#    3th       Andy   19     1.7

4.DataFrame排序函数

def DataFrame_Sort():
    data = {'Name':  {'No.1':'Tom', 'No.2':'Kim', 'No.3':'Andy'}, 
            'Age':   {'No.1':18,    'No.2':16,    'No.3':19}, 
            'Height':{'No.1':1.6,   'No.2':1.5,   'No.3':1.7}}
    df = pd.DataFrame(data)
    df.index.name = 'ID'
    df.columns.name = 'StudentInfo'
#StudentInfo  Age  Height  Name
#ID                            
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim
#No.3          19     1.7  Andy
    
    #行索引排序，升序
    ret = df.sort_index(ascending=True)
#StudentInfo  Age  Height  Name
#ID                            
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim
#No.3          19     1.7  Andy

    #行索引排序，降序
    ret = df.sort_index(ascending=False)
#StudentInfo  Age  Height  Name
#ID                            
#No.3          19     1.7  Andy
#No.2          16     1.5   Kim
#No.1          18     1.6   Tom

    #数据排序，按照指定列排序，降序
    ret = df.sort_values(by='Age', ascending=True) #按照Age列降序排序
#StudentInfo  Age  Height  Name
#ID                            
#No.2          16     1.5   Kim
#No.1          18     1.6   Tom
#No.3          19     1.7  Andy
    
    #数据排序，按照指定列排序，升序
    ret = df.sort_values(by='Age', ascending=False)
#StudentInfo  Age  Height  Name
#ID                            
#No.3          19     1.7  Andy
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim

5.DataFrame汇总统计函数

# -*- coding: utf-8 -*-
"""
@author: 蔚蓝的天空Tom
Aim:DataFrame的汇总统计功能函数

df.count()	非NaN的数量
df.describe()	一次性产生多个汇总统计
df.min() 最小值
df.min() 最大值
df.idxmax(axis=0, skipna=True) 返回含有最大值的index的Series
df.idxmin(axis=0, skipna=True) 返回含有最小值的index的Series
df.quantile(axis=0)	计算样本的分位数
df.sum(axis=0, skipna=True, level=NaN) 返回一个含有求和小计的Series
df.mean(axis=0, skipna=True, level=NaN) 返回一个含有平均值的Series
df.median(axis=0, skipna=True, level=NaN) 返回一个含有算术中位数的Series
df.mad(axis=0, skipna=True, level=NaN) 返回一个根据平均值计算平均绝对离差的Series
df.var(axis=0, skipna=True, level=NaN) 返回一个方差的Series
df.std(axis=0, skipna=True, level=NaN) 返回一个标准差的Series
df.skew(axis=0, skipna=True, level=NaN) 返回样本值的偏度（三阶距）
df.kurt(axis=0, skipna=True, level=NaN) 返回样本值的峰度（四阶距）
df.cumsum(axis=0, skipna=True, level=NaN) 返回样本的累计和
df.cummin(axis=0, skipna=True, level=NaN) 返回样本的累计最大值
df.cummax(axis=0, skipna=True, level=NaN) 返回样本的累计最小值
df.cumprod(axis=0, skipna=True, level=NaN) 返回样本的累计积
df.diff(axis=0) 返回样本的一阶差分
df.pct_change(axis=0) 返回样本的百分比数变化
"""

import pandas as pd
from pandas import DataFrame

if __name__=='__main__':
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    ind = ['No.1', 'No.2', 'No.3']
    df = pd.DataFrame(data, index=ind)
    df.index.name = 'ID'
    df.columns.name = 'StudentInfo'
#StudentInfo  Age  Height  Name
#ID                            
#No.1          18     1.6   Tom
#No.2          16     1.5   Kim
#No.3          19     1.7  Andy

    #df.count() 非NaN的数量
    cnt = df.count()
#StudentInfo
#Age       3
#Height    3
#Name      3
#dtype: int64

    #df.describe()一次性产生多个汇总统计(包括count, mean, std, min, max等)
    ret = df.describe() #<class 'pandas.core.frame.DataFrame'> 
#StudentInfo        Age  Height
#count         3.000000    3.00
#mean         17.666667    1.60
#std           1.527525    0.10
#min          16.000000    1.50
#25%          17.000000    1.55
#50%          18.000000    1.60
#75%          18.500000    1.65
#max          19.000000    1.70

    #df.min() 最小值，每列的最小数值
    ret = df.min()
#StudentInfo
#Age         16
#Height     1.5
#Name      Andy
#dtype: object
    
    #df.min() 最大值，每列的最大数值
    ret = df.max()
#StudentInfo
#Age        19
#Height    1.7
#Name      Tom
#dtype: object
    
    #df.idxmax(axis=0, skipna=True) 返回含有最大值的index的Series
    data = {'Age':[18,16,19],
            'Height':[1.6, 1.5, 1.7],
            'Math':[60, 70, 100],
            'English':[98, 68, 69],
            'Chinese':[50, 99, 70]}
    ind = ['No.1', 'No.2', 'No.3']
    df = pd.DataFrame(data, index=ind)
    df.index.name = 'ID'
    df.columns.name = 'Student'
#Student  Age  Chinese  English  Height  Math
#ID                                          
#No.1      18       50       98     1.6    60
#No.2      16       99       68     1.5    70
#No.3      19       70       69     1.7   100

    #df.idxmin(axis=0, skipna=True) 返回含有最小值的index的Series
    ret = df.idxmax(axis = 0) #<class 'pandas.core.series.Series'>
#Student
#Age        No.3
#Chinese    No.2
#English    No.1
#Height     No.3
#Math       No.3
#dtype: object
    
    #每行最大数据所在列名
    ret = df.idxmax(axis = 1) #<class 'pandas.core.series.Series'>
#ID
#No.1    English
#No.2    Chinese
#No.3    Math
#dtype: object
    
    #df.quantile(axis=0)	计算样本的分位数（有二分位数，四分位数等）
    ret = df.quantile(axis = 0) #每列样本的中位数
#Student
#Age        18.0
#Chinese    70.0
#English    69.0
#Height      1.6
#Math       70.0
#dtype: float64
    
    #df.sum(axis=0, skipna=True, level=NaN) 返回一个含有求和小计的Series
    ret = df.sum(axis=0) #每列样本的总和
#Student
#Age         53.0
#Chinese    219.0
#English    235.0
#Height       4.8
#Math       230.0
#dtype: float64

    ret = df.sum(axis=1) #每行数据的总和，从此样本看没有任何意义
#ID
#No.1    227.6
#No.2    254.5
#No.3    259.7
#dtype: float64

    #df.mean(axis=0, skipna=True, level=NaN) 返回一个含有平均值的Series
    ret = df.mean(axis=0) #每列样本的平均值
#Student
#Age        17.666667
#Chinese    73.000000
#English    78.333333
#Height      1.600000
#Math       76.666667
#dtype: float64

    ret = df.mean(axis=1) #每行数据的平均值，以此样本看没有任何意义
#ID
#No.1    45.52
#No.2    50.90
#No.3    51.94
#dtype: float64

    #df.median(axis=0, skipna=True, level=NaN) 返回一个含有算术中位数的Series
    ret = df.median(axis=0) #每列样本的中位数
#Student
#Age        18.0
#Chinese    70.0
#English    69.0
#Height      1.6
#Math       70.0
#dtype: float64
    ret = df.median(axis=1) #每行数据的中位数
#ID
#No.1    50.0
#No.2    68.0
#No.3    69.0
#dtype: float64

    #df.mad(axis=0, skipna=True, level=NaN) 返回一个根据平均值计算平均绝对离差的Series
    #绝对离差=单项数值与平均值之差的绝对值
#Student  Age  Chinese  English  Height  Math
#ID                                          
#No.1      18       50       98     1.6    60
#No.2      16       99       68     1.5    70
#No.3      19       70       69     1.7   100
    ret = df.mad(axis=0) #逐列求值
#Student
#Age         1.111111
#Chinese    17.333333
#English    13.111111
#Height      0.066667
#Math       15.555556
#dtype: float64
    ret = df.mad(axis=1) #逐行求值
#ID
#No.1    28.576
#No.2    33.720
#No.3    33.272
#dtype: float64

    #df.var(axis=0, skipna=True, level=NaN) 返回一个方差的Series
    ret = df.var(axis=0) #逐列操作求方差
#Student
#Age          2.333333
#Chinese    607.000000
#English    290.333333
#Height       0.010000
#Math       433.333333
#dtype: float64
    ret = df.var(axis=1) #逐行操作求方差
#ID
#No.1    1417.552
#No.2    1657.300
#No.3    1634.018
#dtype: float64
    
    #df.std(axis=0, skipna=True, level=NaN) 返回一个标准差的Series
    ret = df.std(axis=0) #逐列求标准差  
#Student
#Age         1.527525
#Chinese    24.637370
#English    17.039171
#Height      0.100000
#Math       20.816660
#dtype: float64
    ret = df.std(axis=1) #逐行求标准差
#ID
#No.1    37.650392
#No.2    40.709950
#No.3    40.422989
#dtype: float64

    #df.skew(axis=0, skipna=True, level=NaN) 返回样本值的偏度（三阶距）
    ret = df.skew(axis=0) #逐列求样本值的偏度（三阶矩）
#Student
#Age       -0.935220
#Chinese    0.539824
#English    1.725342
#Height     0.000000
#Math       1.293343
#dtype: float64
    ret = df.skew(axis=1) #逐行求样本值的偏度（三阶矩）
#ID
#No.1    0.328682
#No.2   -0.245853
#No.3   -0.256661
#dtype: float64
 
   
    #df.kurt(axis=0, skipna=True, level=NaN) 返回样本值的峰度（四阶距）
    ret = df.kurt(axis=0) #逐列求样本值的峰度（四阶距）
#Student
#Age       NaN
#Chinese   NaN
#English   NaN
#Height    NaN
#Math      NaN
#dtype: float64

    ret = df.kurt(axis=1) #逐行求样本值的峰度（四阶距）
#ID
#No.1   -0.582437
#No.2   -2.079006
#No.3   -1.879115
#dtype: float64
    
    #df.cumsum(axis=0, skipna=True, level=NaN) 返回样本的累计和
    ret = df.cumsum(axis=0) #逐列求累积和
#Student   Age  Chinese  English  Height   Math
#ID                                            
#No.1     18.0     50.0     98.0     1.6   60.0
#No.2     34.0    149.0    166.0     3.1  130.0
#No.3     53.0    219.0    235.0     4.8  230.0
    ret = df.cumsum(axis=1)#逐行求累积和
#Student   Age  Chinese  English  Height   Math
#ID                                            
#No.1     18.0     68.0    166.0   167.6  227.6
#No.2     16.0    115.0    183.0   184.5  254.5
#No.3     19.0     89.0    158.0   159.7  259.7
    
    #df.cummin(axis=0, skipna=True, level=NaN) 返回样本的累计最小值
    ret = df.cummin(axis=0) #逐列求累计最小值
#Student   Age  Chinese  English  Height  Math
#ID                                           
#No.1     18.0     50.0     98.0     1.6  60.0
#No.2     16.0     50.0     68.0     1.5  60.0
#No.3     16.0     50.0     68.0     1.5  60.0
    ret = df.cummin(axis=1) #逐行求累计最小值
#Student   Age  Chinese  English  Height  Math
#ID                                           
#No.1     18.0     18.0     18.0     1.6   1.6
#No.2     16.0     16.0     16.0     1.5   1.5
#No.3     19.0     19.0     19.0     1.7   1.7

    #df.cummax(axis=0, skipna=True, level=NaN) 返回样本的累计最大值
    ret = df.cummax(axis=0) #逐列求累计最大值
#Student   Age  Chinese  English  Height   Math
#ID                                            
#No.1     18.0     50.0     98.0     1.6   60.0
#No.2     18.0     99.0     98.0     1.6   70.0
#No.3     19.0     99.0     98.0     1.7  100.0
    ret = df.cummax(axis=1) #逐行求累计最大值
#Student   Age  Chinese  English  Height   Math
#ID                                            
#No.1     18.0     50.0     98.0    98.0   98.0
#No.2     16.0     99.0     99.0    99.0   99.0
#No.3     19.0     70.0     70.0    70.0  100.0
    
    #df.cumprod(axis=0, skipna=True, level=NaN) 返回样本的累计积
    ret = df.cumprod(axis=0) #逐列求累计积
#Student     Age   Chinese   English  Height      Math
#ID                                                   
#No.1       18.0      50.0      98.0    1.60      60.0
#No.2      288.0    4950.0    6664.0    2.40    4200.0
#No.3     5472.0  346500.0  459816.0    4.08  420000.0
    ret = df.cumprod(axis=1) #逐行求累计积
#Student   Age  Chinese   English    Height        Math
#ID                                                    
#No.1     18.0    900.0   88200.0  141120.0   8467200.0
#No.2     16.0   1584.0  107712.0  161568.0  11309760.0
#No.3     19.0   1330.0   91770.0  156009.0  15600900.0

    #df.diff(axis=0) 返回样本的一阶差分
    ret = df.diff(axis=0) #逐列求一阶差分
#Student  Age  Chinese  English  Height  Math
#ID                                          
#No.1     NaN      NaN      NaN     NaN   NaN
#No.2    -2.0     49.0    -30.0    -0.1  10.0
#No.3     3.0    -29.0      1.0     0.2  30.0
    ret = df.diff(axis=1) #逐行求一阶差分
#<class 'pandas.core.frame.DataFrame'>
#Student  Age  Chinese  English  Height  Math
#ID                                          
#No.1     NaN     32.0     48.0     NaN -38.0
#No.2     NaN     83.0    -31.0     NaN   2.0
#No.3     NaN     51.0     -1.0     NaN  31.0
    
    #df.pct_change(axis=0) 返回样本的百分比数变化
    ret =df.pct_change(axis=0) #逐列求百分比数变化
#Student       Age   Chinese   English    Height      Math
#ID                                                       
#No.1          NaN       NaN       NaN       NaN       NaN
#No.2    -0.111111  0.980000 -0.306122 -0.062500  0.166667
#No.3     0.187500 -0.292929  0.014706  0.133333  0.428571
    ret = df.pct_change(axis=1) #逐行求百分比数变化
#Student  Age   Chinese   English    Height       Math
#ID                                                   
#No.1     NaN  1.777778  0.960000 -0.983673  36.500000
#No.2     NaN  5.187500 -0.313131 -0.977941  45.666667
#No.3     NaN  2.684211 -0.014286 -0.975362  57.823529

6.DataFrame计算函数

# -*- coding: utf-8 -*-
"""
@author: 蔚蓝的天空Tom
Aim:实现DataFrame的计算函数的示例
df.add(df2, fill_value=NaN, axist=1) 元素级相加，对齐时找不到元素默认用fill_value
df.sub(df2, fill_value=NaN, axist=1) 元素级相减，对齐时找不到元素默认用fill_value 
df.div(df2, fill_value=NaN, axist=1) 元素级相除，对齐时找不到元素默认用fill_value 
df.mul(df2, fill_value=NaN, axist=1) 元素级相乘，对齐时找不到元素默认用fill_value 
df.apply(f, axis=0)	                 将f函数应用到由各行各列所形成的一维数组上
df.applymap(f)	                      将f函数应用到各个元素上
df.cumsum(axis=0, skipna=True)	  累加，返回累加后的dataframe
"""

import pandas as pd
from pandas import DataFrame

if __name__=='__main__':
    data = {'Math':[2, 4, 6],
            'English':[4, 8, 12]}
    ind = ['No.1', 'No.2', 'No.3']
    df1 = pd.DataFrame(data, index=ind)
    df1.index.name = 'ID'
    df1.columns.name = 'Student'
#Student  English  Math
#ID                    
#No.1           4     2
#No.2           8     4
#No.3          12     6

    data = {'Math':[1,2,3],
            'English':[2,4,6]}
    ind = ['No.1', 'No.2', 'No.3']
    df2 = pd.DataFrame(data, index=ind)
    df2.index.name = 'ID'
    df2.columns.name = 'Student'
#Student  English  Math
#ID                    
#No.1           2     1
#No.2           4     2
#No.3           6     3

    #df.add(df2, fill_value=NaN, axist=1) 元素级相加，对齐时找不到元素默认用fill_value
    ret = df1.add(df2) #对应元素相加
#Student  English  Math
#ID                    
#No.1           6     3
#No.2          12     6
#No.3          18     9

    #df.sub(df2, fill_value=NaN, axist=1) 元素级相减，对齐时找不到元素默认用fill_value
    ret = df1.sub(df2) #对应元素相减
#Student  English  Math
#ID                    
#No.1           2     1
#No.2           4     2
#No.3           6     3

    #df.div(df2, fill_value=NaN, axist=1) 元素级相除，对齐时找不到元素默认用fill_value 
    ret = df1.div(df2) #对应元素相除
#Student  English  Math
#ID                    
#No.1         2.0   2.0
#No.2         2.0   2.0
#No.3         2.0   2.0
    
    #df.mul(df2, fill_value=NaN, axist=1) 元素级相乘，对齐时找不到元素默认用fill_value 
    ret = df1.mul(df2) #对应元素相乘
#Student  English  Math
#ID                    
#No.1           8     2
#No.2          32     8
#No.3          72    18

    #df.apply(f, axis=0)	将f函数应用到由各行各列所形成的一维数组上
#Student  English  Math
#ID                    
#No.1           4     2
#No.2           8     4
#No.3          12     6
    import numpy as np
    ret = df1.apply(np.square) #对每个元素进行开平方np.squre
#Student  English  Math
#ID                    
#No.1          16     4
#No.2          64    16
#No.3         144    36

    #df.applymap(f)	将f函数应用到各个元素上
    ret = df1.applymap(np.square)
#Student  English  Math
#ID                    
#No.1          16     4
#No.2          64    16
#No.3         144    36

    #df.cumsum(axis=0, skipna=True)	  累加，返回累加后的dataframe
#Student  English  Math
#ID                    
#No.1           4     2
#No.2           8     4
#No.3          12     6
    ret = df1.cumsum(axis=0) #对每列内的元素，进行累加
#Student  English  Math
#ID                    
#No.1           4     2
#No.2          12     6
#No.3          24    12
    ret = df1.cumsum(axis=1) #对每行内的元素，进行累加
#Student  English  Math
#ID                    
#No.1           4     6
#No.2           8    12
#No.3          12    18

7.DataFrame常用索引方式例程

# -*- coding: utf-8 -*-
"""
@author: 蔚蓝的天空Tom
Aim:完成DataFrame的索引方式的示例----df[], df.ix[], df.reindex(), df.xs(), df.icol()等
索引方式	说明
df[val]	选取DataFrame的单个列或一组列
df.ix[val]	选取Dataframe的单个行或一组行
df.ix[:,val]	选取单个列或列子集
df.ix[val1,val2]	将一个或多个轴匹配到新索引
reindex方法	将一个或多个轴匹配到新索引
xs方法	根据标签选取单行或者单列，返回一个Series
icol、irow方法	根据整数位置选取单列或单行，并返回一个Series
get_value、set_value	根据行标签和列标签选取单个值
"""

import pandas as pd
from pandas import DataFrame

if __name__=='__main__':
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Math':[95, 98, 96]}
    ind = ['No.1', 'No.2', 'No.3']
    df = pd.DataFrame(data, index=ind, columns=['Name', 'Age', 'Math'])
    df.index.name = 'ID'
    df.columns.name = 'Student'
#Student  Name  Age  Math
#ID                      
#No.1      Tom   18    95
#No.2      Kim   16    98
#No.3     Andy   19    96
    
    #选取DataFrame的单个列
    ret = df[[0]] #df的第1列
#Student  Name
#ID           
#No.1      Tom
#No.2      Kim
#No.3     Andy

    ret = df[[-1]] #df的最后一列
#Student  Math
#ID           
#No.1       95
#No.2       98
#No.3       96

    ret = df[[-1, 0]] #df的最后一列和第一列
#Student  Math  Name
#ID                 
#No.1       95   Tom
#No.2       98   Kim
#No.3       96  Andy

    #df.ix[val]	选取Dataframe的单个行或一组行
    ret = df.ix[[0]] #df的第一行
#Student Name  Age  Math
#ID                     
#No.1     Tom   18    95

    ret = df.ix[[-1]] #df的最后一行
#Student  Name  Age  Math
#ID                      
#No.3     Andy   19    96
    
    ret = df.ix[[-1,0]] #df的最后一行和第一行
#Student  Name  Age  Math
#ID                      
#No.3     Andy   19    96
#No.1      Tom   18    95

    #df.ix[:,val]	选取单个列或列子集
    
    ret = df.ix[0:2, [0]] #第一列中从0到1序号的列子集
#Student Name
#ID          
#No.1     Tom
#No.2     Kim
    
    ret = df.ix[:-1, [0]] #第一列中不包含最后一个元素的列子集
#Student Name
#ID          
#No.1     Tom
#No.2     Kim

    
    #df.ix[val1,val2]	将一个或多个轴匹配到新索引
    ret = df.ix[[0], [0]] #求第一行第一列元素
#Student Name 
#ID          
#No.1     Tom
    ret = df.ix[[0], [1]] #求第一行第二列元素
#Student  Age
#ID          
#No.1      18
    ret = df.ix[[1], [0]] #求第2行第一列元素
#Student Name
#ID          
#No.2     Kim

df.reindex()+df.xs()+df.iloc[] + df.get_value() + df.get_values() + df.set_value()

import pandas as pd
from pandas import DataFrame

if __name__=='__main__':
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.7, 1.5, 1.6]}
    ind = ['No.1', 'No.2', 'No.3']
    df = pd.DataFrame(data, index=ind, columns=['Name', 'Age', 'Height'])
    df.index.name = 'ID'
    df.columns.name = 'Student'
#Student  Name  Age  Height
#ID                        
#No.1      Tom   18     1.7
#No.2      Kim   16     1.5
#No.3     Andy   19     1.6

    #reindex方法  将一个或多个轴匹配到新索引  
    ret = df.reindex(index=['No.3', 'No.2', 'No.1']) #按照指定的行索引显示
#Student  Name  Age  Height
#ID                        
#No.3     Andy   19     1.6
#No.2      Kim   16     1.5
#No.1      Tom   18     1.7
    
    ret = df.reindex(index=['No.3', 'No.2', 'No.1'], columns=['Name', 'Age'])
#Student  Name   Age
#ID                 
#No.3     Andy  19.0
#No.2      Kim  16.0
#No.       NaN   NaN
    
    ret = df.reindex(index=['No.1'], columns=['Name', 'Age'])
#Student Name  Age
#ID               
#No.1     Tom   18

    ret = df.reindex(index=['No.1'], columns=['Name'])
#Student Name
#ID          
#No.1     Tom

    #xs方法   根据标签选取单行或者单列，返回一个Series  
    ret = df.xs(key='No.1', axis=0)#获取由key指定的行No.1，必须设置axis=0
#Student
#Name      Tom
#Age        18
#Height    1.7
#Name: No.1, dtype: object
    ret = df.xs(key='Name', axis=1) #获取由key指定的列Name，必须设置axis=1
#ID
#No.1     Tom
#No.2     Kim
#4No.3    Andy
#Name: Name, dtype: object
    ret = df.xs(key='Age', axis=1) #获取由key指定的列Age，必须设置axis=1
#ID
#No.1    18
#No.2    16
#No.3    19
#Name: Age, dtype: int64

    #icol、irow方法    根据整数位置选取单列或单行，并返回一个Series  
    ret = df.iloc[:,0] #获取每行的第一列元素，即获取df的第一列
#ID
#No.1     Tom
#No.2     Kim
#No.3    Andy
#Name: Name, dtype: object
    ret = df.iloc[:,-1] #获取每行的最后一列元素，即获取df的最后一列
#ID
#No.1    1.7
#No.2    1.5
#No.3    1.6
#Name: Height, dtype: float64

    ret = df.iloc[:-1, 0]
#ID
#No.1    Tom
#No.2    Kim
#Name: Name, dtype: object
    ret = df.iloc[:1,0] #<class 'pandas.core.series.Series'>
#ID
#No.1    Tom
#Name: Name, dtype: object
    ret = df.iloc[0, 0] #<class 'str'>
#Tom    
    
    #ret = df.irow()
    ret = df.iloc[0] #获取第一行
#Student
#Name      Tom
#Age        18
#Height    1.7
#Name: No.1, dtype: object
    ret = df.iloc[-1] #获取最后一行
#Student
#Name      Andy
#Age         19
#Height     1.6
#Name: No.3, dtype: object

    #get_value、set_value    根据行标签和列标签选取单个值  
    ret = df.get_value(index='No.1', col='Name')
#Tom
    ret = df.get_value(index='No.1', col='Age')
#18
    ret = df.get_values()
#[['Tom' 18 1.7]
# ['Kim' 16 1.5]
# ['Andy' 19 1.6]]
    
    #set_value(index, col, value) 设置[index, col]元素数值为value
    ret = df.set_value(index='No.1', col='Name', value='John')
    print(df.get_values())
#[['John' 18 1.7]
# ['Kim' 16 1.5]
# ['Andy' 19 1.6]]

8.DataFrame的方法示例汇总

def DataFrame_manual():
    '''
    DataFrame类型类似于数据库表结构的数据结构，含有行索引和列索引
    可以将DataFrame看成由相同索引的Series组成的Dict类型。
    在其底层是通过二维以及一维的数据块实现
    '''
    import pandas as pd
    from pandas import DataFrame
    
    #1. DataFrame对象的创建
    #1.1用包含等长的列表或者是NumPy数组的字典创建DataFrame对象
    #建立等长列表的字典类型
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    #建立DataFrame对象
    #使用默认索引[0,1,2,....]
    df = pd.DataFrame(data) #默认索引，默认列的顺序
#       Age  Height  Name
#    0   18     1.6   Tom
#    1   16     1.5   Kim
#    2   19     1.7  Andy
    #指定列的顺序
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Height'])
#       Name  Age  Height
#    0   Tom   18     1.6
#    1   Kim   16     1.5
#    2  Andy   19     1.7

    #指定DataFrame的索引
    df = pd.DataFrame(data, index=['1st', '2nd', '3th'])
#         Age  Height  Name
#    1st   18     1.6   Tom
#    2nd   16     1.5   Kim
#    3th   19     1.7  Andy

    #1.2 用嵌套dict生成DataFrame对象
    #用嵌套dict生成DataFrame，外部的dict索引会成为列名，内部的dict索引会成为行名
    #生成的DataFrame会根据行索引排序
    data = {'Name':  {'1st':'Tom', '2nd':'Kim', '3th':'Andy'}, 
            'Age':   {'1st':18,    '2nd':16,    '3th':19}, 
            'Height':{'1st':1.6,   '2nd':1.5,   '3th':1.7}}
    df = pd.DataFrame(data) #使用嵌套dict指定的行序列，使用默认的列序列(列名字典排序)
#         Age  Height  Name
#    1st   18     1.6   Tom
#    2nd   16     1.5   Kim
#    3th   19     1.7  Andy
    df = pd.DataFrame(data, ['3th', '2nd', '1st']) #指定行的序列
#         Age  Height  Name
#    3th   19     1.7  Andy
#    2nd   16     1.5   Kim
#    1st   18     1.6   Tom

    #2访问DataFrame
    #从DataFrame中获取一列的结果为一个Series，有两种方法
    #2.1字典索引方式获取
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Height'], index=['1st', '2nd', '3th'])
#         Name  Age  Height
#    1st   Tom   18     1.6
#    2nd   Kim   16     1.5
#    3th  Andy   19     1.7
    s = df['Name']
#    1st     Tom
#    2nd     Kim
#    3th    Andy
#    Name: Name, dtype: object

    #2.2通过ix获取一行数据
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    s = df.ix['1st'] #获取单行，参数为 行索引值
#    Name      Tom
#    Age        18
#    Height    1.6
#    Name: 1st, dtype: object
    s = df.ix[0] #获取单行，参数 默认数字行索引
#    Name      Tom
#    Age        18
#    Height    1.6
#    Name: 1st, dtype: object
    s = df.ix[['3th', '2nd']]#获取多行
#         Name  Age  Height
#    3th  Andy   19     1.7
#    2nd   Kim   16     1.5
    s = df.ix[range(3)] #通过默认数字行索引获取数据
#          Name  Age  Height
#    1st   Tom   18     1.6
#    2nd   Kim   16     1.5
#    3th  Andy   19     1.7

    #2.3获取指定行，指定列的交汇值
    ret = df['Name']['1st']  #Tom
    ret = df['Name'][0]      #Tom
    ret = df['Age']['1st']   #18
    ret = df['Age'][0]       #18
    ret = df['Height']['1st']#1.6
    ret = df['Height'][0]    #1.6
    
    #2.4获取指定列，指定行的交汇值
    ret = df.ix['1st']['Name']  #Tom
    ret = df.ix[0]['Name']      #Tom
    ret = df.ix['1st']['Age']   #18
    ret = df.ix[0]['Age']       #18
    ret = df.ix['1st']['Height']#1.6
    ret = df.ix[0]['Height']    #1.6

    #3.修改DataFame对象
    #3.1增加列
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    df['Grade'] = 9 #增加一列，年级'Grade'，为同一值9年级
#         Name  Age  Height  Grade
#    1st   Tom   18     1.6      9
#    2nd   Kim   16     1.5      9
#    3th  Andy   19     1.7      9

    #3.2修改一列的值
    df['Grade'] = [6,7,7]
#         Name  Age  Height  Grade
#    1st   Tom   18     1.6      6
#    2nd   Kim   16     1.5      7
#    3th  Andy   19     1.7      7

    #3.3判断Grade是否为7年级
    s = pd.Series([False, True, True], index=['1st', '2nd', '3th'])
    df['HighGrade'] = s #新增一列'HighGrade'，用Series赋值
#         Name  Age  Height  Grade HighGrade
#    1st   Tom   18     1.6      6     False
#    2nd   Kim   16     1.5      7      True
#    3th  Andy   19     1.7      7      True

    #4.命令DataFrame的行、列
    data = {'Name':['Tom', 'Kim', 'Andy'],
            'Age':[18, 16, 19],
            'Height':[1.6, 1.5, 1.7]}
    df = pd.DataFrame(data, 
                      columns=['Name', 'Age', 'Height'], 
                      index=['1st', '2nd', '3th'])
    df.columns.name = 'Students'
    df.index.name = 'ID'
#    Students  Name  Age  Height
#    ID                         
#    1st        Tom   18     1.6
#    2nd        Kim   16     1.5
#    3th       Andy   19     1.7
if __name__=='__main__':
    pandas_manual()
    #Series_manual()
    #DataFrame_manual()

(end)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。