我们知道Pandas中最常用的两种数据类型就是Series 和DataFrame,因此主要学习者两种数据类型的缺失数据的处理方法
拿到一份数据,首先查看数据完整性,了解缺失信息;
然后针对缺失的数据,是要进行填充?还是删除?还是插值?选择合适的方式对拿到的数据进行数据预处理。
1. 了解缺失信息:主要就是用isna和notna方法,对于isna方法,如果为缺失值,则返回True,如果不为缺失值,则返回False;notna方法则反之
import numpy as np
import pandas as pd
#首选读取一份数据
df = pd.read_csv('E:/data/table_missing.csv')
#这样子读出来的数据结构是DataFrame
df.head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | NaN | M | street_1 | 173 | NaN | 34.0 | A+ |
1 | S_1 | C_1 | NaN | F | street_2 | 192 | NaN | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | NaN | 87.2 | B+ |
3 | S_1 | NaN | NaN | F | street_2 | 167 | 81.0 | 80.4 | NaN |
4 | S_1 | C_1 | 1105.0 | NaN | street_4 | 159 | 64.0 | 84.8 | A- |
#然后通过isna方法或者notna方法来查看缺失值
df['Physics'].isna().head()
0 False
1 False
2 False
3 True
4 False
Name: Physics, dtype: bool
有时候对于一份数据,我们并不关心某一个缺失值,而是关心某个需要分析的属性值的数据完整性。如果该属性数据缺失较少,
可以考虑用合适的插值方法来对数据进行拟合(模拟);如果数据缺失较多,或许该数据数据就不适合应用到我们的分析当中。
可以利用如下方法来了解拿到手的数据的一个完整性情况:
#计算每一列有多少个缺失值
df.isna().sum()
School 0
Class 4
ID 6
Gender 7
Address 0
Height 0
Weight 13
Math 5
Physics 4
dtype: int64
#也可以通过info函数来查看每一列的数据缺失情况
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 31 non-null object
2 ID 29 non-null float64
3 Gender 28 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 22 non-null float64
7 Math 30 non-null float64
8 Physics 31 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB
2.处理缺失值
2.1 填充
主要用到的方法是:DataFrame.fillna(value = None,method = None,axis = None, inplace = False,limit=None, downcast=None, **kwargs)
其中,
value是要用去填充缺失值的值;
# 1.直接用missing去填充所有的缺失值
df.fillna('missing').head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | 1103.0 | M | street_1 | 173 | 81.0 | 34.0 | A+ |
1 | S_1 | C_1 | 1103.0 | F | street_2 | 192 | 81.0 | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | 81.0 | 87.2 | B+ |
3 | S_1 | C_1 | 1105.0 | F | street_2 | 167 | 81.0 | 80.4 | A- |
4 | S_1 | C_1 | 1105.0 | M | street_4 | 159 | 64.0 | 84.8 | A- |
# 2.用缺失值上面的值替换缺失值
df.fillna(method= 'ffill').head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | NaN | M | street_1 | 173 | NaN | 34.0 | A+ |
1 | S_1 | C_1 | NaN | F | street_2 | 192 | NaN | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | NaN | 87.2 | B+ |
3 | S_1 | C_1 | 1103.0 | F | street_2 | 167 | 81.0 | 80.4 | B+ |
4 | S_1 | C_1 | 1105.0 | F | street_4 | 159 | 64.0 | 84.8 | A- |
# 3.用缺失值下面的值替换缺失值
df.fillna(method='backfill').head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | 1103.0 | M | street_1 | 173 | 81.0 | 34.0 | A+ |
1 | S_1 | C_1 | 1103.0 | F | street_2 | 192 | 81.0 | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | 81.0 | 87.2 | B+ |
3 | S_1 | C_1 | 1105.0 | F | street_2 | 167 | 81.0 | 80.4 | A- |
4 | S_1 | C_1 | 1105.0 | M | street_4 | 159 | 64.0 | 84.8 | A- |
2.2 剔除
主要用的是函数:DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
删除含有空值的行或者列
# 1.删除行,只要哪行有空值就删掉哪行;
df.dropna().head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
5 | S_1 | C_2 | 1201.0 | M | street_5 | 159 | 68.0 | 97.0 | A- |
6 | S_1 | C_2 | 1202.0 | F | street_4 | 176 | 94.0 | 63.5 | B- |
12 | S_1 | C_3 | 1303.0 | M | street_7 | 188 | 82.0 | 49.7 | B |
17 | S_2 | C_1 | 2103.0 | M | street_4 | 157 | 61.0 | 52.5 | B- |
21 | S_2 | C_2 | 2202.0 | F | street_7 | 194 | 77.0 | 68.5 | B+ |
# 2. 删除列,只要哪一列有空值,就删掉哪一列
df.dropna(axis= 1).head()
School | Address | Height | |
0 | S_1 | street_1 | 173 |
1 | S_1 | street_2 | 192 |
2 | S_1 | street_2 | 186 |
3 | S_1 | street_2 | 167 |
4 | S_1 | street_4 | 159 |
# 3.某一行所有值全为缺失值才删除
df.dropna(how= 'all').head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | NaN | M | street_1 | 173 | NaN | 34.0 | A+ |
1 | S_1 | C_1 | NaN | F | street_2 | 192 | NaN | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | NaN | 87.2 | B+ |
3 | S_1 | NaN | NaN | F | street_2 | 167 | 81.0 | 80.4 | NaN |
4 | S_1 | C_1 | 1105.0 | NaN | street_4 | 159 | 64.0 | 84.8 | A- |
# 4.某一行至少出现过2个缺失值才删除
df.dropna(thresh=2).head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
0 | S_1 | C_1 | NaN | M | street_1 | 173 | NaN | 34.0 | A+ |
1 | S_1 | C_1 | NaN | F | street_2 | 192 | NaN | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | NaN | 87.2 | B+ |
3 | S_1 | NaN | NaN | F | street_2 | 167 | 81.0 | 80.4 | NaN |
4 | S_1 | C_1 | 1105.0 | NaN | street_4 | 159 | 64.0 | 84.8 | A- |
2.3插值
2.3.1 线性插值
s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s.plot()
s.interpolate().plot()