class sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)

主要参数说明:

1.missing_values: integer or “NaN”, optional (default=”NaN”)
    缺失值,可以为整数或NaN(缺失值numpy.nan用字符串‘NaN’表示),默认为NaN
2.strategy : string, optional (default=”mean”)
  替换策略,字符串,默认用均值‘mean’替换
  ①使用平均值代替
  If “mean”, then replace missing values using the mean along the axis. 
  ②使用中值代替
  If “median”, then replace missing values using the median along the axis.
  ③使用众数代替,也就是出现次数最多的数
  If “most_frequent”, then replace missing using the most frequent value along the axis.

3.axis:指定轴数,默认axis=0代表按列处理,axis=1代表按行处理   
4.copy:设置为True代表不在原数据集上修改,设置为False时,就地修改,存在如下情况时,
即使设置为False时,也不会就地修改。  
①X不是浮点值数组   
②X是稀疏且missing_values=0  
③axis=0且X为CRS矩阵 
④axis=1且X为CSC矩阵  
5.statistics_属性:axis设置为0时,每个特征的填充值数组,axis=1时,报没有该属性错误。  

注意:
Imputer只接受DataFrame类型,且Dataframe中必须全部为数值属性。

1.数值属性的列较少,可以将数值属性的列单独取出来

import pandas as pd
import numpy as np
df = pd.DataFrame([["XXL", 8, "black", "class 1", 22],
                   ["L", np.nan, "gray", "class 2", 20],
                   ["XL", 10, "blue", "class 2", 19],
                   ["M", np.nan, "orange", "class 1", 17],
                   ["M", 11, "green", "class 3", np.nan],
                   ["M", 7, "red", "class 1", 22]])
df
0 1 2 3 4
0 XXL 8.0 black class 1 22.0
1 L NaN gray class 2 20.0
2 XL 10.0 blue class 2 19.0
3 M NaN orange class 1 17.0
4 M 11.0 green class 3 NaN
5 M 7.0 red class 1 22.0
df.columns = ["size", "price", "color", "class", "boh"]
print(df)
  size  price   color    class   boh
0  XXL    8.0   black  class 1  22.0
1    L    NaN    gray  class 2  20.0
2   XL   10.0    blue  class 2  19.0
3    M    NaN  orange  class 1  17.0
4    M   11.0   green  class 3   NaN
5    M    7.0     red  class 1  22.0
from sklearn.preprocessing import Imputer

# 1. 创建Imputer器
imp = Imputer(missing_values="NaN", strategy="mean",axis=0 )
# 先只将处理price列的数据, 注意使用的是df[['price']]  这样返回的是一个DataFrame类型的数据!!!!
# 2. 使用fit_transform()函数即可完成缺失值填充了
df["price"] = imp.fit_transform(df[["price"]])
df
size price color class boh
0 XXL 8.0 black class 1 22.0
1 L 9.0 gray class 2 20.0
2 XL 10.0 blue class 2 19.0
3 M 9.0 orange class 1 17.0
4 M 11.0 green class 3 NaN
5 M 7.0 red class 1 22.0
df[["price"]]
price
0 8.0
1 9.0
2 10.0
3 9.0
4 11.0
5 7.0
# 直接处理price和boh两列
df[['price', 'boh']] = imp.fit_transform(df[['price', 'boh']])
df
size price color class boh
0 XXL 8.0 black class 1 22.0
1 L 9.0 gray class 2 20.0
2 XL 10.0 blue class 2 19.0
3 M 9.0 orange class 1 17.0
4 M 11.0 green class 3 20.0
5 M 7.0 red class 1 22.0

2.数值属性的列较多,相反文本或分类属性(text and category attribute)较少,可以先删除文本属性,处理完以后再合并

from sklearn.preprocessing import Imputer

# 1.创建Iimputer
imputer = Imputer(strategy="median")
# 只有一个文本属性,故先去掉
housing_num = housing.drop("ocean_proximity", axis=1)
# 2. 使用fit_transform函数
X = imputer.fit_transform(housing_num)
# 返回的是一个numpyarray,要转化为DataFrame
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

# 将文本属性值添加
housing_tr['ocean_proximity'] = housing["ocean_proximity"]

housing_tr[:2]

3.实例1:

import numpy as np
from sklearn.preprocessing import Imputer
train_X = np.array([[1, 2], [np.nan, 3], [7, 6]])
imp = Imputer(missing_values=np.nan , strategy='mean', axis=0)
imp.fit(train_X)
Imputer(axis=0, copy=True, missing_values=np.nan, strategy='mean', verbose=0)
 imp.statistics_
array([ 4.        ,  3.66666667])
test_X = np.array([[np.nan, 2], [6, np.nan], [7, 6]])
imp.transform(test_X)
array([[ 4.        ,  2.        ],
       [ 6.        ,  3.66666667],
       [ 7.        ,  6.        ]])
imp.fit_transform(test_X)
array([[ 6.5,  2. ],
       [ 6. ,  4. ],
       [ 7. ,  6. ]])
imp.statistics_
array([ 6.5,  4. ])

4.实例2:

import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1,2,3,4
5,6,,8
0,11,12,'''

csv_data 
'A,B,C,D\n1,2,3,4\n5,6,,8\n0,11,12,'
df = pd.read_csv(StringIO(csv_data))
print(df)
# 统计为空的数目
print(df.isnull().sum())
print(df.values)
   A   B     C    D
0  1   2   3.0  4.0
1  5   6   NaN  8.0
2  0  11  12.0  NaN
A    0
B    0
C    1
D    1
dtype: int64
[[  1.   2.   3.   4.]
 [  5.   6.  nan   8.]
 [  0.  11.  12.  nan]]
# 丢弃空的
print(df.dropna())
print('after:\n', df)
   A  B    C    D
0  1  2  3.0  4.0
after:
    A   B     C    D
0  1   2   3.0  4.0
1  5   6   NaN  8.0
2  0  11  12.0  NaN
from sklearn.preprocessing import Imputer

# axis=0 列  axis = 1 行
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr.fit(df) # fit 构建得到数据
imputed_data = imr.transform(df.values) #transform 将数据进行填充
print(imputed_data)
[[  1.    2.    3.    4. ]
 [  5.    6.    7.5   8. ]
 [  0.   11.   12.    6. ]]

5.参考:

https://blog.csdn.net/kancy110/article/details/75041923
https://blog.csdn.net/dss_dssssd/article/details/82831240