class sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)
主要参数说明:
1.missing_values: integer or “NaN”, optional (default=”NaN”)
缺失值,可以为整数或NaN(缺失值numpy.nan用字符串‘NaN’表示),默认为NaN
2.strategy : string, optional (default=”mean”)
替换策略,字符串,默认用均值‘mean’替换
①使用平均值代替
If “mean”, then replace missing values using the mean along the axis.
②使用中值代替
If “median”, then replace missing values using the median along the axis.
③使用众数代替,也就是出现次数最多的数
If “most_frequent”, then replace missing using the most frequent value along the axis.
3.axis:指定轴数,默认axis=0代表按列处理,axis=1代表按行处理
4.copy:设置为True代表不在原数据集上修改,设置为False时,就地修改,存在如下情况时,
即使设置为False时,也不会就地修改。
①X不是浮点值数组
②X是稀疏且missing_values=0
③axis=0且X为CRS矩阵
④axis=1且X为CSC矩阵
5.statistics_属性:axis设置为0时,每个特征的填充值数组,axis=1时,报没有该属性错误。
注意:
Imputer只接受DataFrame类型,且Dataframe中必须全部为数值属性。
1.数值属性的列较少,可以将数值属性的列单独取出来
import pandas as pd
import numpy as np
df = pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])
df
|
0 |
1 |
2 |
3 |
4 |
---|
0 |
XXL |
8.0 |
black |
class 1 |
22.0 |
---|
1 |
L |
NaN |
gray |
class 2 |
20.0 |
---|
2 |
XL |
10.0 |
blue |
class 2 |
19.0 |
---|
3 |
M |
NaN |
orange |
class 1 |
17.0 |
---|
4 |
M |
11.0 |
green |
class 3 |
NaN |
---|
5 |
M |
7.0 |
red |
class 1 |
22.0 |
---|
df.columns = ["size", "price", "color", "class", "boh"]
print(df)
size price color class boh
0 XXL 8.0 black class 1 22.0
1 L NaN gray class 2 20.0
2 XL 10.0 blue class 2 19.0
3 M NaN orange class 1 17.0
4 M 11.0 green class 3 NaN
5 M 7.0 red class 1 22.0
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values="NaN", strategy="mean",axis=0 )
df["price"] = imp.fit_transform(df[["price"]])
df
|
size |
price |
color |
class |
boh |
---|
0 |
XXL |
8.0 |
black |
class 1 |
22.0 |
---|
1 |
L |
9.0 |
gray |
class 2 |
20.0 |
---|
2 |
XL |
10.0 |
blue |
class 2 |
19.0 |
---|
3 |
M |
9.0 |
orange |
class 1 |
17.0 |
---|
4 |
M |
11.0 |
green |
class 3 |
NaN |
---|
5 |
M |
7.0 |
red |
class 1 |
22.0 |
---|
df[["price"]]
|
price |
---|
0 |
8.0 |
---|
1 |
9.0 |
---|
2 |
10.0 |
---|
3 |
9.0 |
---|
4 |
11.0 |
---|
5 |
7.0 |
---|
df[['price', 'boh']] = imp.fit_transform(df[['price', 'boh']])
df
|
size |
price |
color |
class |
boh |
---|
0 |
XXL |
8.0 |
black |
class 1 |
22.0 |
---|
1 |
L |
9.0 |
gray |
class 2 |
20.0 |
---|
2 |
XL |
10.0 |
blue |
class 2 |
19.0 |
---|
3 |
M |
9.0 |
orange |
class 1 |
17.0 |
---|
4 |
M |
11.0 |
green |
class 3 |
20.0 |
---|
5 |
M |
7.0 |
red |
class 1 |
22.0 |
---|
2.数值属性的列较多,相反文本或分类属性(text and category attribute)较少,可以先删除文本属性,处理完以后再合并
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
X = imputer.fit_transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr['ocean_proximity'] = housing["ocean_proximity"]
housing_tr[:2]
3.实例1:
import numpy as np
from sklearn.preprocessing import Imputer
train_X = np.array([[1, 2], [np.nan, 3], [7, 6]])
imp = Imputer(missing_values=np.nan , strategy='mean', axis=0)
imp.fit(train_X)
Imputer(axis=0, copy=True, missing_values=np.nan, strategy='mean', verbose=0)
imp.statistics_
array([ 4. , 3.66666667])
test_X = np.array([[np.nan, 2], [6, np.nan], [7, 6]])
imp.transform(test_X)
array([[ 4. , 2. ],
[ 6. , 3.66666667],
[ 7. , 6. ]])
imp.fit_transform(test_X)
array([[ 6.5, 2. ],
[ 6. , 4. ],
[ 7. , 6. ]])
imp.statistics_
array([ 6.5, 4. ])
4.实例2:
import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1,2,3,4
5,6,,8
0,11,12,'''
csv_data
'A,B,C,D\n1,2,3,4\n5,6,,8\n0,11,12,'
df = pd.read_csv(StringIO(csv_data))
print(df)
print(df.isnull().sum())
print(df.values)
A B C D
0 1 2 3.0 4.0
1 5 6 NaN 8.0
2 0 11 12.0 NaN
A 0
B 0
C 1
D 1
dtype: int64
[[ 1. 2. 3. 4.]
[ 5. 6. nan 8.]
[ 0. 11. 12. nan]]
print(df.dropna())
print('after:\n', df)
A B C D
0 1 2 3.0 4.0
after:
A B C D
0 1 2 3.0 4.0
1 5 6 NaN 8.0
2 0 11 12.0 NaN
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr.fit(df)
imputed_data = imr.transform(df.values)
print(imputed_data)
[[ 1. 2. 3. 4. ]
[ 5. 6. 7.5 8. ]
[ 0. 11. 12. 6. ]]
5.参考:
https://blog.csdn.net/kancy110/article/details/75041923
https://blog.csdn.net/dss_dssssd/article/details/82831240