数据预处理:
读取数据:
import pandas as pd data=pd.read_csv(r'C:\Users\Administrator\Desktop\insurance.csv',encoding=('utf-8'))
筛选数据:
# 去除噪点 data_1 = data.query('age<=40 & charges<=10000') # 40岁以下 且 10000元以下 data_2 = data.query('age>40 & age<=50 & charges<=12000') # 40~50岁 且 12000元以下 data_3 = data.query('age>50 & charges<=17500') # 50岁以上 且 17500元以下 new_data = pd.concat([data_1, data_2, data_3], axis=0) #合并数据,axis=0(以列名相同合并) axis=1(以行名合并)
按照内容筛选行:
x_1=data[data[4]=='Iris-setosa'].values #筛选第四列中,内容为'Iris'的所有行,提取出来 x_2=data[data[4]=='Iris-versicolor'].values x_3=data[data[4]=='Iris-virginica'].values
选用特定列:
X = new_data.iloc[:, 0:1].values y = new_data['charges'].values y = data['charges'].values data_1 = data.drop(['charges'], axis = 1) #去除charges这一列,axis=0表示跨行,axis=1表示跨列 X = data_1.values
特征缩放:
from sklearn.preprocessing import StandardScaler # 特征缩放 sc_x = StandardScaler() #标准化 x_train = sc_x.fit_transform(x_train) #转化 x_test = sc_x.transform(x_test) #转化 sc_y = StandardScaler() y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1))) #转化
···························模型区域·························
y_pred = regressor.predict(x_test) #预测
y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的