一、快速查看数据结构

import numpy as np
import pandas as pd
csv_path = "./datasets/housing/housing.csv"
housing = pd.read_csv(csv_path)
housing.head()

#获取数据集简单描述

housing.info()

#输出结果

RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)

memory usage: 1.6+ MB注意,total_bed这个属性只有20433个非空值,这意味着有207个区域缺失这个特征。我们后面需要考虑到这一点

#查看有多少种分类存在,每种类别下分别有多少个区域

housing["ocean_proximity"].value_counts()

#输出结果

<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5

Name: ocean_proximity, dtype: int64下面绘制每个数值属性的直方图,更直观展示数据类型

%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins = 50 ,figsize = (20,15))

plt.show()可以看出收入中位数这个属性看起来不像是用美元(USD)在衡量,有可能经过预处理,需确认

房龄中位数和房价中位数被设定了上限,需确认是否存在此问题

这些属性值被缩放的程度各不相同

许多直方图都表现出重尾:图形在中位数右侧的延伸比左侧要远得多。这可能会导致某些机器学习算法难以检测模式。稍后我们会尝试一些转化方法,将这些属性转化为更偏向钟形的分布

二、数据可视化

housing.plot(kind = 'scatter' , x = 'longitude' , y = 'latitude' , alpha = 0.4 ,
s = housing['population']/100 , label = 'population' ,
c = 'median_house_value' , cmap = plt.get_cmap('jet') , colorbar = True)

plt.legend()每个圆的半径大小代表了每个地区的人口数量,颜色代表价格

从中我们可以看出,房屋价格与地理位置(例如靠海)和人口密度息息相关

三、寻找相关性

#皮尔逊相关系数

corr_matrix = housing.corr()

corr_matrix['median_house_value'].sort_values(ascending = False)

#输出结果

median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724

Name: median_house_value, dtype: float64相关性可视化

from pandas.plotting import scatter_matrix

attributes = ['median_house_value' , 'median_income' , 'total_rooms' , 'housing_median_age']

scatter_matrix(housing[attributes] , figsize = (12,8))可以看出最有潜力能够预测房价中位数的属性是收入中位数

housing.plot(kind = 'scatter' ,x = 'median_income' , y = 'median_house_value' , alpha =0.1)二者相关性确实很强,可以清楚地看到上升的趋势,并且点也不是太分散

下面尝试不同的属性组合

housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending = False)

#输出结果

median_house_value 1.000000
median_income 0.687160
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64

新的属性bedrooms_per_room较之“房间总数”或是“卧室总数”与房价中位数的相关性都要高得多。显然卧室/房间比例更低的房屋,往往价格更贵。同样“每个家庭的房间数量”也比“房间总数”更具信息量——房屋越大,价格越贵。

四、数据清理中位数填充缺失值:缺失值的比例并不大

文本和分类属性:LabelBinarizer编码

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot

五、转换流水线

from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([('selector' , DataFrameSelector(num_attribs)) ,
('imputer' , Imputer(strategy = 'median')) ,
('attribs_adder' , CombinedAttributesAdder()) ,
('std_scaler' , StandardScaler())])
cat_pipeline = Pipeline([('selector' , DataFrameSelector(cat_attribs)) ,
('one_hot_encoder', OneHotEncoder(sparse=False))])
full_pipeline = FeatureUnion(transformer_list =
[('num_pipeline' , num_pipeline) ,
('cat_pipeline' , cat_pipeline)])
housing_prepared = full_pipeline.fit_transform(housing)

六、训练模型交叉验证选择最好的模型

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators':[3,10,30] , 'max_features':[2,4,6,8]} ,
{'bootstrap':[False] ,'n_estimators':[3,10] , 'max_features':[2,3,4]} ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg , param_grid ,cv = 5 , scoring = 'neg_mean_squared_error' )
grid_search.fit(housing_prepared , housing_labels)查看特征重要性
extra_attribs = ['rooms_per_hhold' , 'pop_per_hhold' ,'bedrooms_per_room']
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances ,attributes) , reverse = True)

#输出结果

[(0.291635668838594, 'median_income'),
(0.17489109146974177, 'INLAND'),
(0.14439182119379482, 'bedrooms_per_room'),
(0.074869901101824, 'longitude'),
(0.07357815468357778, 'latitude'),
(0.06103587007543961, 'pop_per_hhold'),
(0.05051236255812475, 'rooms_per_hhold'),
(0.03817873258218327, 'housing_median_age'),
(0.017068304209294654, 'total_rooms'),
(0.016648949734042177, 'population'),
(0.016648735677349303, '<1H OCEAN'),
(0.01615807233330034, 'households'),
(0.01612266013911572, 'total_bedrooms'),
(0.005380418728999661, 'NEAR OCEAN'),
(0.0027898326726339853, 'NEAR BAY'),
(8.942400198404977e-05, 'ISLAND')]

七、通过测试集评估系统

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop('median_house_value' , axis = 1)
y_test = strat_test_set['median_house_value'].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test , final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

#输出结果

48058.86061404973