分析目标:

  1. 通过处理后的房价数据,筛选对房价有显著影响的特征变量。
  2. 确定特征变量,建立深圳房价预测模型并对假设情景进行模拟

数据预处理 

import pandas as pd
import os
file_path="D:\Python数据分析与挖掘实战\深圳二手房价分析\data"
#读取file_path目录下的所有文件
file_name=os.listdir(file_path)
df=pd.DataFrame()
lis=[]
#使用两种方法读取数据
for i in file_name:
    file=pd.read_excel(os.path.join(file_path,i))
    # lis.append(file)
    df=df.append(file)
# df=pd.concat(lis)
#更改第一列的名字
df=df.rename(columns={'Unnamed: 0':'house_id'})
#查看数据描述
 print(df.describe())

out:

           house_id       roomnum  ...        subway     per_price
count  1.851400e+04  18514.000000  ...  18514.000000  18514.000000
mean   2.280900e+08      2.873339  ...      0.504159      6.118192
std    3.031648e+06      1.040839  ...      0.499996      3.050218
min    4.217338e+06      1.000000  ...      0.000000      1.010100
25%    2.276957e+08      2.000000  ...      0.000000      4.052600
50%    2.284619e+08      3.000000  ...      1.000000      5.246300
75%    2.288307e+08      3.000000  ...      1.000000      7.357400
max    2.289965e+08      9.000000  ...      1.000000     26.396800

[8 rows x 8 columns]


#查看数据信息,是否有缺失值,可以看到总共有10个字段,其中house_id字段是没有用的,我们可以删除

df=df.drop(columns='house_id')

print(df.info())

out:

 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   district   18514 non-null  object 
 1   roomnum    18514 non-null  int64  
 2   hall       18514 non-null  int64  
 3   AREA       18514 non-null  float64
 4   C_floor    18514 non-null  object 
 5   floor_num  18514 non-null  int64  
 6   school     18514 non-null  int64  
 7   subway     18514 non-null  int64  
 8   per_price  18514 non-null  float64
dtypes: float64(2), int64(5), object(2)


#我们可以看到字段中只有每平米的单价,我们可以加一个字段为总价,多一个维度进行分析,总价为面积乘以每平米单价
df['total_price']=df['AREA']*df['per_price']
print(df['total_price'])

out:

0       632.002890
1       879.995700
2       110.000800
3        93.990400
4       395.998200
           ...    
1487    116.000040
1488    119.999383
1489    145.001298
1490    128.999772
1491     80.999928
Name: total_price, Length: 18514, dtype: float64


#查看是否有重复项
print(df.duplicated().sum())

out:

0
area_map={'baoan':'宝安','dapengxinqu':'大鹏新区','futian':'福田','guangming':'光明',
          'longhua':'龙华','luohu':'罗湖','nanshan':'南山','pingshan':'坪山','yantian':'盐田'
          ,'longgang':'龙岗'}
df['district']=df['district'].apply(lambda x : area_map[x])

 特征变量分析

         1.district特征变量分析

 

R语言深圳房价预测_回归

 由上图可以看出:

  1. 南山区二点平均房价最高,大鹏新区最低。
  2. 平均总价南山区最高,坪山区最低。
  3. 二手房总数量有18514套,数量最多的为罗湖,接近18%。
  4. 由箱型图可以看出随着区域不同,箱子中心明显不同,说明房价跟区域有关系。

 2.roomnum特征变量分析

 

R语言深圳房价预测_数据挖掘_02

 由上图可以看出:

  1. 房间数量为6的平均单价最高。
  2. 卧室数量对平均单价的影响不明显。

 3.hall特征变量分析

R语言深圳房价预测_R语言深圳房价预测_03

 

 由上图可以看出:

  1. 厅数量为3的平均单价最高。
  2. 厅数量对平均单价有一定影响。

  4.c_floor特征变量分析

R语言深圳房价预测_机器学习_04

 由上图可以看出:

1.不同楼层对均价影响不大。

5.school,subway特征变量分析

R语言深圳房价预测_回归_05

 由上图可以看出:

1.可以看到,靠近地铁站的二手房均价要明显高于不靠近地铁站的二手房。

2.学校对房价的影响没有地铁站大。

5.面积特征变量分析

R语言深圳房价预测_数据挖掘_06

plt.scatter(df.AREA,df.per_price,marker='x',color='b',alpha=0.5)
plt.title('面积AREA 和 单位面积房价per_price的散点图')
plt.ylabel("单位面积房价")
plt.xlabel("面积(平方米)")
plt.show()

 

R语言深圳房价预测_机器学习_07

 

 由上图可以看出:

1.二手房以68-102的小户型居多,几乎占到了总数量的一半。

2.面积变化,单价波动明显,说明面积对单价有一定影响。

3.小户型明显比大户型受欢迎。

6.floor特征变量分析

R语言深圳房价预测_数据_08

 由上图可以看出:

1.随着楼层的变化,平均单价波动较大,所以楼层对单价有影响。

机器学习预测房价

 由上面的分析可以看出(区域、房间数量、学校、楼层数、是否靠近地铁站、面积、厅数)等7个特征对房价有影响,因此将这些特征作为作为机器算法的输入,经过训练拟合后输出预测的房价。

首先使用one-hot编码将类别变量(区域,房间数量,厅数)转化为数值型变量,学校和地铁已经转化过了不需要转化,连续变量可以不转为数值型变量。

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.svm import SVR
import xgboost as xgb
from sklearn.preprocessing import StandardScaler

# 学校和是否靠近地铁不需要转化成数值型变量
Roomnum = pd.get_dummies(df['roomnum'])
Roomnum.rename(columns={i: 'roomnum_' + str(i) for i in Roomnum.columns}, inplace=True)

District = pd.get_dummies(df['district'])
District.rename(columns={i: 'district_' + str(i) for i in District.columns}, inplace=True)

Hall = pd.get_dummies(df['hall'])
Hall.rename(columns={i: 'hall_' + str(i) for i in Hall.columns}, inplace=True)

data_new = pd.concat([Roomnum, District, Hall, df], axis=1)
data_new = data_new.drop(columns=['district', 'hall', 'roomnum', 'C_floor', 'total_price'], axis=1)

# 确定数据中的特征与标签
x = data_new.loc[:, data_new.columns != "per_price"]
fea_imp = x.columns
y = data_new.loc[:, 'per_price']
# 数据分割,随机采样30%作为测试样本,其余作为训练样本
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=10, test_size=0.3)
# print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# reshape(-1,1)表示任意行,一列
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

# 数据标准化处理


ss_x = StandardScaler()

ss_y = StandardScaler()
# fit_transform是fit和transform的组合,既包括了训练又包含了转换。
# transform()和fit_transform()二者的功能都是对数据进行某种统一处理
# (比如标准化~N(0,1),将数据缩放(映射)到某个固定区间,归一化,正则化等)
x_train = ss_x.fit_transform(x_train)

x_test = ss_x.transform(x_test)
mean_y = np.mean(y_train)

s_y = np.var(y_train)

y_train = ss_y.fit_transform(y_train)


y_test = ss_y.transform(y_test)


# 线性回归
lr = LinearRegression()
# 支持向量机回归
svr = SVR(kernel="rbf")
param = {'max_depth': 3,
         'learning_rate': 0.1,
         'n_estimators': 100,
         'objective': 'reg:linear',  # 此默认参数与 XGBClassifier 不同
         'booster': 'gbtree',
         'gamma': 0,
         'min_child_weight': 1,
         'subsample': 1,
         'colsample_bytree': 1,
         'reg_alpha': 0,
         'reg_lambda': 1,
         'random_state': 2}
dtrain = xgb.DMatrix(x_train, label=y_train, feature_names=fea_imp)
dtest = xgb.DMatrix(x_test, label=y_test, feature_names=fea_imp)
num_round = 100
watchlist = [(dtrain, 'train'), (dtest, 'test')]

lr.fit(x_test, y_test)
svr.fit(x_test, y_test)
xg = xgb.train(param, dtrain, num_round, evals=watchlist, early_stopping_rounds=10)

out:


[0]	train-rmse:1.04640	test-rmse:1.04475
[1]	train-rmse:0.98451	test-rmse:0.98363
[2]	train-rmse:0.93073	test-rmse:0.93074
[3]	train-rmse:0.88439	test-rmse:0.88517
[4]	train-rmse:0.84244	test-rmse:0.84438
[5]	train-rmse:0.80710	test-rmse:0.81022
[6]	train-rmse:0.77707	test-rmse:0.78133
[7]	train-rmse:0.75142	test-rmse:0.75731
[8]	train-rmse:0.72903	test-rmse:0.73546
[9]	train-rmse:0.71017	test-rmse:0.71765
[10]	train-rmse:0.69146	test-rmse:0.69960
[11]	train-rmse:0.67692	test-rmse:0.68635
[12]	train-rmse:0.66298	test-rmse:0.67298
[13]	train-rmse:0.65210	test-rmse:0.66320
[14]	train-rmse:0.64173	test-rmse:0.65345
[15]	train-rmse:0.63378	test-rmse:0.64664
[16]	train-rmse:0.62579	test-rmse:0.63936
[17]	train-rmse:0.61993	test-rmse:0.63410
[18]	train-rmse:0.61405	test-rmse:0.62849
[19]	train-rmse:0.60906	test-rmse:0.62385
[20]	train-rmse:0.60482	test-rmse:0.62023
[21]	train-rmse:0.60097	test-rmse:0.61680
[22]	train-rmse:0.59790	test-rmse:0.61411
[23]	train-rmse:0.59470	test-rmse:0.61117
[24]	train-rmse:0.59193	test-rmse:0.60857
[25]	train-rmse:0.58939	test-rmse:0.60670
[26]	train-rmse:0.58688	test-rmse:0.60459
[27]	train-rmse:0.58516	test-rmse:0.60302
[28]	train-rmse:0.58327	test-rmse:0.60164
[29]	train-rmse:0.58163	test-rmse:0.60073
[30]	train-rmse:0.58036	test-rmse:0.59952
[31]	train-rmse:0.57869	test-rmse:0.59794
[32]	train-rmse:0.57696	test-rmse:0.59644
[33]	train-rmse:0.57596	test-rmse:0.59559
[34]	train-rmse:0.57463	test-rmse:0.59455
[35]	train-rmse:0.57330	test-rmse:0.59346
[36]	train-rmse:0.57212	test-rmse:0.59236
[37]	train-rmse:0.57119	test-rmse:0.59174
[38]	train-rmse:0.57041	test-rmse:0.59106
[39]	train-rmse:0.56929	test-rmse:0.59012
[40]	train-rmse:0.56834	test-rmse:0.58927
[41]	train-rmse:0.56733	test-rmse:0.58862
[42]	train-rmse:0.56628	test-rmse:0.58777
[43]	train-rmse:0.56491	test-rmse:0.58683
[44]	train-rmse:0.56391	test-rmse:0.58618
[45]	train-rmse:0.56332	test-rmse:0.58569
[46]	train-rmse:0.56253	test-rmse:0.58515
[47]	train-rmse:0.56154	test-rmse:0.58434
[48]	train-rmse:0.56082	test-rmse:0.58371
[49]	train-rmse:0.56017	test-rmse:0.58302
[50]	train-rmse:0.55971	test-rmse:0.58265
[51]	train-rmse:0.55917	test-rmse:0.58233
[52]	train-rmse:0.55806	test-rmse:0.58155
[53]	train-rmse:0.55745	test-rmse:0.58125
[54]	train-rmse:0.55672	test-rmse:0.58080
[55]	train-rmse:0.55569	test-rmse:0.57971
[56]	train-rmse:0.55514	test-rmse:0.57927
[57]	train-rmse:0.55455	test-rmse:0.57894
[58]	train-rmse:0.55408	test-rmse:0.57853
[59]	train-rmse:0.55361	test-rmse:0.57818
[60]	train-rmse:0.55305	test-rmse:0.57795
[61]	train-rmse:0.55269	test-rmse:0.57773
[62]	train-rmse:0.55180	test-rmse:0.57714
[63]	train-rmse:0.55151	test-rmse:0.57698
[64]	train-rmse:0.55115	test-rmse:0.57673
[65]	train-rmse:0.55049	test-rmse:0.57638
[66]	train-rmse:0.54969	test-rmse:0.57585
[67]	train-rmse:0.54928	test-rmse:0.57555
[68]	train-rmse:0.54904	test-rmse:0.57539
[69]	train-rmse:0.54829	test-rmse:0.57457
[70]	train-rmse:0.54804	test-rmse:0.57442
[71]	train-rmse:0.54737	test-rmse:0.57405
[72]	train-rmse:0.54685	test-rmse:0.57380
[73]	train-rmse:0.54622	test-rmse:0.57343
[74]	train-rmse:0.54584	test-rmse:0.57330
[75]	train-rmse:0.54572	test-rmse:0.57320
[76]	train-rmse:0.54557	test-rmse:0.57312
[77]	train-rmse:0.54502	test-rmse:0.57257
[78]	train-rmse:0.54446	test-rmse:0.57215
[79]	train-rmse:0.54392	test-rmse:0.57191
[80]	train-rmse:0.54342	test-rmse:0.57153
[81]	train-rmse:0.54309	test-rmse:0.57132
[82]	train-rmse:0.54299	test-rmse:0.57130
[83]	train-rmse:0.54251	test-rmse:0.57103
[84]	train-rmse:0.54239	test-rmse:0.57095
[85]	train-rmse:0.54197	test-rmse:0.57077
[86]	train-rmse:0.54146	test-rmse:0.57042
[87]	train-rmse:0.54137	test-rmse:0.57035
[88]	train-rmse:0.54091	test-rmse:0.57010
[89]	train-rmse:0.54067	test-rmse:0.56994
[90]	train-rmse:0.54059	test-rmse:0.56993
[91]	train-rmse:0.54031	test-rmse:0.56971
[92]	train-rmse:0.54009	test-rmse:0.56960
[93]	train-rmse:0.53972	test-rmse:0.56913
[94]	train-rmse:0.53932	test-rmse:0.56885
[95]	train-rmse:0.53903	test-rmse:0.56882
[96]	train-rmse:0.53881	test-rmse:0.56873
[97]	train-rmse:0.53849	test-rmse:0.56849
[98]	train-rmse:0.53814	test-rmse:0.56835
[99]	train-rmse:0.53806	test-rmse:0.56828



print('逻辑回归模型的r2-score为:', r2_score(lr.predict(x_test), y_test))
print('支持向量机模型的r2-score为:',r2_score(svr.predict(x_test), y_test))
print('xgboost模型的r2-score为:',r2_score(xg.predict(dtest), y_test))

out:
逻辑回归模型的r2-score为: 0.3927518839179279
支持向量机模型的r2-score为: 0.47270739927209093
xgboost模型的r2-score为: 0.5097990013836289


# 新版本将get_fscore()替换了旧的feature_importance_
im = pd.DataFrame({'importance': xg.get_fscore().values(), 'var': xg.get_fscore().keys()})
im = im.sort_values(by='importance', ascending=False)
print(im.head(10))


out;


    importance            var
22       201.0           AREA
23       168.0      floor_num
24        36.0         school
9         34.0    district_南山
14        29.0    district_福田
25        28.0         subway
12        25.0    district_宝安
17        25.0    district_龙岗
10        20.0    district_坪山
11        15.0  district_大鹏新区
xgb.plot_importance(xg, max_num_features=10, importance_type='gain')
plt.show()
# (三) 假想情形,做预测,x_new是新的自变量
'''
预测要找一个条件为:
1.南山区
2.有3个房间
3.面积大概再80㎡左右
4.有地铁
5.学区房
的房子的大概花费
'''
room = Roomnum.loc[Roomnum['roomnum_3'] == 1].head(1).reset_index(drop=True)
dis = District.loc[District['district_南山'] == 1].head(1).reset_index(drop=True)
hal = Hall.loc[Hall['hall_3'] == 1].head(1).reset_index(drop=True)
x_new1 = pd.concat([room, dis, hal], axis=1)
x_new1['AREA'] = 80
x_new1['floor_num'] = 3
x_new1['school'] = 1
x_new1['subway'] = 1

x_new1_scale = ss_x.transform(x_new1)
dtt = xgb.DMatrix(x_new1_scale, feature_names=fea_imp)
p=xg.predict(dtt)

# 反标准化
per_price=p*s_y+mean_y
print("单位面积房价:", per_price)
print("总价:", per_price * 80)


out:

单位面积房价: [15.402145]
总价: [1232.1716]