我们知道Pandas中最常用的两种数据类型就是Series 和DataFrame,因此主要学习者两种数据类型的缺失数据的处理方法

拿到一份数据,首先查看数据完整性,了解缺失信息;
然后针对缺失的数据,是要进行填充?还是删除?还是插值?选择合适的方式对拿到的数据进行数据预处理。

1. 了解缺失信息:主要就是用isna和notna方法,对于isna方法,如果为缺失值,则返回True,如果不为缺失值,则返回False;notna方法则反之

import numpy as np
import pandas as pd
#首选读取一份数据
df = pd.read_csv('E:/data/table_missing.csv')
#这样子读出来的数据结构是DataFrame
df.head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

NaN

M

street_1

173

NaN

34.0

A+

1

S_1

C_1

NaN

F

street_2

192

NaN

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

NaN

87.2

B+

3

S_1

NaN

NaN

F

street_2

167

81.0

80.4

NaN

4

S_1

C_1

1105.0

NaN

street_4

159

64.0

84.8

A-

#然后通过isna方法或者notna方法来查看缺失值
df['Physics'].isna().head()
0    False
1    False
2    False
3     True
4    False
Name: Physics, dtype: bool

有时候对于一份数据,我们并不关心某一个缺失值,而是关心某个需要分析的属性值的数据完整性。如果该属性数据缺失较少,
可以考虑用合适的插值方法来对数据进行拟合(模拟);如果数据缺失较多,或许该数据数据就不适合应用到我们的分析当中。
可以利用如下方法来了解拿到手的数据的一个完整性情况:

#计算每一列有多少个缺失值
df.isna().sum()
School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64
#也可以通过info函数来查看每一列的数据缺失情况
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    31 non-null     object 
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object 
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

2.处理缺失值

2.1 填充

主要用到的方法是:DataFrame.fillna(value = None,method = None,axis = None, inplace = False,limit=None, downcast=None, **kwargs)
其中,
value是要用去填充缺失值的值;

# 1.直接用missing去填充所有的缺失值
df.fillna('missing').head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

1103.0

M

street_1

173

81.0

34.0

A+

1

S_1

C_1

1103.0

F

street_2

192

81.0

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

81.0

87.2

B+

3

S_1

C_1

1105.0

F

street_2

167

81.0

80.4

A-

4

S_1

C_1

1105.0

M

street_4

159

64.0

84.8

A-

# 2.用缺失值上面的值替换缺失值
df.fillna(method= 'ffill').head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

NaN

M

street_1

173

NaN

34.0

A+

1

S_1

C_1

NaN

F

street_2

192

NaN

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

NaN

87.2

B+

3

S_1

C_1

1103.0

F

street_2

167

81.0

80.4

B+

4

S_1

C_1

1105.0

F

street_4

159

64.0

84.8

A-

# 3.用缺失值下面的值替换缺失值
df.fillna(method='backfill').head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

1103.0

M

street_1

173

81.0

34.0

A+

1

S_1

C_1

1103.0

F

street_2

192

81.0

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

81.0

87.2

B+

3

S_1

C_1

1105.0

F

street_2

167

81.0

80.4

A-

4

S_1

C_1

1105.0

M

street_4

159

64.0

84.8

A-

2.2 剔除

主要用的是函数:DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
删除含有空值的行或者列

# 1.删除行,只要哪行有空值就删掉哪行;
df.dropna().head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

5

S_1

C_2

1201.0

M

street_5

159

68.0

97.0

A-

6

S_1

C_2

1202.0

F

street_4

176

94.0

63.5

B-

12

S_1

C_3

1303.0

M

street_7

188

82.0

49.7

B

17

S_2

C_1

2103.0

M

street_4

157

61.0

52.5

B-

21

S_2

C_2

2202.0

F

street_7

194

77.0

68.5

B+

# 2. 删除列,只要哪一列有空值,就删掉哪一列
df.dropna(axis= 1).head()



School

Address

Height

0

S_1

street_1

173

1

S_1

street_2

192

2

S_1

street_2

186

3

S_1

street_2

167

4

S_1

street_4

159

# 3.某一行所有值全为缺失值才删除
df.dropna(how= 'all').head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

NaN

M

street_1

173

NaN

34.0

A+

1

S_1

C_1

NaN

F

street_2

192

NaN

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

NaN

87.2

B+

3

S_1

NaN

NaN

F

street_2

167

81.0

80.4

NaN

4

S_1

C_1

1105.0

NaN

street_4

159

64.0

84.8

A-

# 4.某一行至少出现过2个缺失值才删除
df.dropna(thresh=2).head()



School

Class

ID

Gender

Address

Height

Weight

Math

Physics

0

S_1

C_1

NaN

M

street_1

173

NaN

34.0

A+

1

S_1

C_1

NaN

F

street_2

192

NaN

32.5

B+

2

S_1

C_1

1103.0

M

street_2

186

NaN

87.2

B+

3

S_1

NaN

NaN

F

street_2

167

81.0

80.4

NaN

4

S_1

C_1

1105.0

NaN

street_4

159

64.0

84.8

A-

2.3插值
2.3.1 线性插值
s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s.plot()

缺失值处理 非正态 EM_Math

s.interpolate().plot()

缺失值处理 非正态 EM_数据_02