支持向量机(Support Vector Machine)属于有监督的机器学习算法,是一种二分类模型,可用于离散因变量的分类和连续因变量的预测。其本质是计算两个观测数据的距离,学习策略是间隔最大化,所寻找的是能够最大化样本间隔的决策边界,因此又被称为大间距分类器。
因为它可使用一个名为核函数的技巧,来将非线性问题变换为线性问题,将低维线性不可分的空间转换为高维线性可分空间,所以它相对于其他单一分类算法(logistic回归、决策树、朴素贝叶斯、KNN等)会有更高的预测准确率。
优点:使用线性核函数的支持向量机类似于逻辑回归,但更具稳健性,因而在实践中,支持向量机最大用处是用非线性核函数来对非线性决策边界进行建模,有许多可选的核函数;具有很好的鲁棒性(增加或删除非支持向量的样本点并不会改变分类器的效果),尤其是在高维空间中;能避免“维度灾难”的发生(模型并不会随数据维度的提升而提高计算的复杂度);具有很好的泛化能力,一定程度上可以避免模型的过拟合;也可以避免模型在运算过程中出现的局部最优。
缺点:支持向量机是内存密集型算法,选择正确的核函数就需要相当的技巧,不适合大样本的分类或预测,因为耗费大量资源和时间;模型对缺失样本很敏感,对核函数的选择也很敏感;SVM为黑盒模型(“相比于回归或决策树等”),对计算结果无法解释;在当前的业界应用中,随机森林的表现往往要优于支持向量机。
一、python参数说明
1. 线性可分
Python提供了线性可分SVM和近似线性可分SVM的实现功能,只需导入sktlearn模块,调用svm子模块中的LinearSVC类,有关其语法和参数说明如下:
LinearSVC(penalty='l2',loss='squared_hinge',dual=True,tol=0.0001,C=1.0,multi_class='ovr',fit_intercept=True,intercept_scaling=1,class_weight-None,verbose=0,random_state=None,max_iter=1000) penalty:略
loss:用于指定损失函数,可以是合页损失函数('hinge')和合页损失函数的平方('squared_hinge'),默认值是后者
dual:略
tol:略
C:指定目标函数中松弛因子的惩罚系数,默认为1。C越大,即对分错样本的惩罚程度越大,因此在训练样本中准确率越高,但是泛化能力降低,也就是对测试数据的分类准确率降低。相反,减小C的话,容许训练样本中有一些误分类错误样本,泛化能力强。对于训练样本带有噪声的情况,一般采用后者,把训练样本集中错误分类的样本作为噪声。
multi_class:分类方式选择参数,str类型,当因变量为多分类问题时用于指定算法分类策略。可选参数为ovr和crammer_singer,默认为ovr。ovr即前面提到的one-vs-rest(OvR),而crammer_singer表示联合分类策略,虽然准确率会更高,但运算过程耗费时间长。
fit_intercept:略
intercept_scaling:略
class_weight:略
verbose:略
random_state:略
max_iter:略
2. 线性不可分
Python中导入sktlearn模块,调用svm子模块中的SVC类,有关其语法和参数说明如下:
klearn.svm.SVC(C=1.0,kernel='rbf', degree=3, gamma='auto',coef0=0.0,shrinking=True,probability=False,tol=0.001,cache_size=200, class_weight=None,verbose=False,max_iter=-1,decision_function_shape='ovr',random_state=None)
C:略
kernel :核函数,默认是rbf,可以是'linear','poly','rbf','sigmoid','precomputed'
linear:线性核函数 ;poly:多项式核函数;rbf:径像核函数/高斯核;sigmod:sigmod核函数;precomputed:核矩阵。precomputed表示自己提前计算好核函数矩阵,这时候算法内部就不再用核函数去计算核矩阵,而是直接用你给的核矩阵,核矩阵需要为n*n的。
degree :只对多项式核函数有用,是指多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。
gamma :'rbf','poly' 和'sigmoid'的核函数参数,默认是'auto'。只对'rbf' ,'poly' ,'sigmod'有效。如果gamma为auto,代表其值为样本特征数的倒数,即1/n_features
coef0 :核函数的常数项。只对'poly'和'sigmoid'有用,是指其中的参数c。
probability :是否启用概率估计,bool类型,可选参数,默认为False,这必须在调用fit()之前启用,并且会fit()方法速度变慢。
shrinking :是否采用启发式收缩(shrinking heuristic)方法,默认为true
tol :略
cache_size :内存大小,float类型,可选参数,默认为200。指定训练所需要的内存,以MB为单位,默认为200MB
class_weight :略
verbose :略
max_iter :略
decision_function_shape :决策函数类型,'ovo'表示one vs one,决策函数形状为[样本数量,类别个数*(类别个数-1)/2],'ovr'表示one vs rest,决策函数形状为[样本数量,类别个数],默认值为'ovr'(sktlearn版本在0.18及以上)
random_state :略
主要调节的参数有:C、kernel、degree、gamma、coef0。
二、项目程序实现
完整代码如下:
# -*- coding: utf-8 -*-
# 关闭警告
# import warnings
# warnings.filterwarnings('ignore')
# 宏观数据和烟草数据回归分析 计算回归系数
import pandas as pd
import numpy as np
import random as rd
# from sklearn.learning_curve import learning_curve
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn import metrics as mt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import svm
from sklearn import preprocessing as prep
# from matplotlib import pyplot as plt
class SupportVectorMachine():
# #数据读取
def data_read(self, data_path, file_name, typeid):
'''
数据读取
:param data_path: 文件存储路径
:param file_name: 文件名
:param typeid: 价位段
:return: 价位段原始数据和价位段去无关变量数据
'''
data = pd.read_excel(data_path + file_name, index_col='pack_bar').dropna() # 删除缺失记录
data1_type = data[data['typeid'].isin(typeid)] # 取出价位段内记录
# data1_type = data1_type[data1_type['ccom_id'].isin([11110001])] # 取出某地市记录
data_type = data1_type.drop(['typeid', 'month_double', 'ccom_id', 'net_month_double'], 1) # ,'net_date'删除无关自变量
return data1_type, data_type
def outlier_filtrate(self, data_type, method='threshold', fill='nan', threshold=1):
'''
异常值处理机制
:param data_type: 原始数据
:param method: 处理方法,{'std':'正态异常','quantile':'箱线异常','threshold':'定值异常'}
:param fill: 值填充方法,{'nan':'空值','extremum':'极值替换'}
:param threshold: 异常值判断阈值,仅当method是threshold有效
:return:新数据
'''
ncol = data_type.shape[1]
colnames = data_type.columns
colnames2 = list(filter(lambda x: x.find('_incr') > 0, colnames)) # 仅判断增长率数据
data2_type = data_type.copy()
for i in range(ncol):
datai = data2_type.iloc[:, i]
# 正态异常
if method == 'std':
xmean = datai.mean()
xstd = datai.std()
up = xmean + 2 * xstd
dw = xmean - 2 * xstd
if any(datai > up):
# print('存在上限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai > up] = np.nan
else:
data2_type.iloc[:, i][datai > up] = datai[datai < up].max()
else:
print('不存在上限异常值')
if any(datai < dw):
# print('存在下限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai < dw] = np.nan
else:
data2_type.iloc[:, i][datai < dw] = datai[datai < dw].min()
else:
print('不存在下限异常值')
# 箱线图异常
if method == 'quantile':
q1 = datai.quantile(0.25)
q3 = datai.quantile(0.75)
up = q3 + 1.5 * (q3 - q1)
dw = q1 - 1.5 * (q3 - q1)
if any(datai > up):
# print('存在上限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai > up] = np.nan
else:
data2_type.iloc[:, i][datai > up] = datai[datai < up].max()
else:
print('不存在上限异常值')
if any(datai < dw):
# print('存在下限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai < dw] = np.nan
else:
data2_type.iloc[:, i][datai < dw] = datai[datai < dw].min()
else:
print('不存在下限异常值')
# 超过阈值异常
if method == 'threshold':
# 箱线图监测
if colnames2.__contains__(colnames[i]):
up = threshold
dw = (-1.0) * threshold
if any(datai > up):
# print('存在上限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai > up] = np.nan
else:
data2_type.iloc[:, i][datai > up] = up
else:
print('不存在上限异常值')
if any(datai < dw):
# print('存在下限异常值')
if fill == 'nan':
data2_type.iloc[:, i][datai < dw] = np.nan
else:
data2_type.iloc[:, i][datai < dw] = dw
else:
print('不存在下限异常值')
# temp = abs(data2_type[colnames2]) <= threshold # 判断是否异常
# lab = temp.apply(lambda x: x.min(), axis=1) # 每行只要有异常值就为False
data2_type = data2_type.dropna() # 删除增长率在1以上的记录
return data2_type
def corr_filtrate(self, data_type, thred_corr=0.5):
'''
根据相关性阈值筛选变量
:param data_type:原数据
:param thred_corr:相关性阈值
:return:新数据
'''
corrX = data_type.corr()
colnames = data_type.columns
colnames3 = list()
for j in range(corrX.shape[1] - 1): # 删除相关系数大于0.5的变量
for i in range(j + 1, corrX.shape[0] - 1):
if abs(corrX.iloc[i, j]) >= thred_corr:
if np.mean(corrX.iloc[i, :]) < np.mean(corrX.iloc[:, j]): # 去掉其中平均绝对相关系数较大的那一个
colnames3.append(colnames[j])
else:
colnames3.append(colnames[i])
break
colnames4 = colnames.drop(list(set(colnames3)))
data2_type = data_type[colnames4]
return data2_type
def vif_filtrate(self, data2_type, thred_vif=4):
'''
膨胀因子阈值筛选变量
:param data2_type: 原数据
:param thred_vif: 膨胀因子阈值
:return: 新数据
'''
vif = [round(variance_inflation_factor(data2_type.values, i), 2) for i in range(data2_type.shape[1])] # 共线性检验
data3_type = data2_type.copy()
while sum(list(map(lambda x: x >= thred_vif, vif))) > 0:
colnames = data3_type.columns[:-1]
for i in range(vif.__len__() - 1): # 删除共线性较强的变量
if vif[i] >= thred_vif:
data3_type = data3_type.drop(colnames[i], 1)
vif = [round(variance_inflation_factor(data3_type.values, i), 2) for i in
range(data3_type.shape[1])] # 共线性检验
break
return data3_type
def data_scale(self, data3_type, method='minmax'): # 数据标准化
'''
数据标准化(归一化)
:param data3_type: 原数据
:param method: 标准化方法,{'minmax':'0-1标准化',
'z-score':'正态标准化',
'normalize':'归一化'
'maxabs':'缩放比例为绝对值最大值,并保留正负号',
'robust':'四分之一和四分之三分位点之间'}
:return: 新数据
'''
if method == 'minmax':
# 0-1标准化
data_minmax = prep.minmax_scale(data3_type, feature_range=(0, 1), axis=0, copy=True) # 直接用标准化函数
data_scale = pd.DataFrame(data=data_minmax, columns=data3_type.columns, index=data3_type.index)
elif method == 'z-score':
# z-score标准化
data_zs = prep.scale(data3_type, axis=0, with_mean=True, with_std=True, copy=True) # 直接用标准化函数
data_scale = pd.DataFrame(data=data_zs, columns=data3_type.columns, index=data3_type.index)
elif method == 'normalize':
# 归一化处理
data_norm = prep.normalize(data3_type, norm='l2', axis=1) # 直接用标准化函数
data_scale = pd.DataFrame(data=data_norm, columns=data3_type.columns, index=data3_type.index)
elif method == 'maxabs':
# 数据的缩放比例为绝对值最大值,并保留正负号,即在区间[-1, 1]内。唯一可用于稀疏数据scipy.sparse的标准化
data_ma = prep.maxabs_scale(data3_type, axis=0, copy=True)
data_scale = pd.DataFrame(data=data_ma, columns=data3_type.columns, index=data3_type.index)
elif method == 'robust':
# 通过 Interquartile Range(IQR) 标准化数据,即四分之一和四分之三分位点之间
data_rb = prep.robust_scale(data3_type, axis=0, with_centering=True, with_scaling=True, copy=True)
data_scale = pd.DataFrame(data=data_rb, columns=data3_type.columns, index=data3_type.index)
data4_type = data_scale
return data4_type
def data_factor(self, data4_type, replace='dependent', threshold=0.1, colnames=None):
'''
数据二值化
:param data3_type: 原数据
:param replace: 替换的列,{'all':'all','dependent':'因变量','colnames':'自己输入变量名'}
:param threshold:二值化阈值
:param colnames:list类型,存储列名,仅当replace值为colnames时有效
:return:新数据
'''
# #######输入参数########
# data_type:价位段内相关数据;
# incr:变量因子化划分界限,如变量转换为0,1,则大于incr的变量等于1,小于incr的等于0
# #######################
data5_type = data4_type.copy()
nrow = data5_type.shape[0]
if replace == 'all':
# 所有变量二值化
data_binary = prep.binarize(data4_type, threshold=threshold,
copy=True) # 按照阈值threshold将数据转换成成0-1,小于等于threshold为 0
data_new = pd.DataFrame(data=data_binary, columns=data5_type.columns, index=data5_type.index)
data5_type = data_new
elif replace == 'dependent':
# 因变量二值化
for i in range(nrow):
value = 1 if data5_type.iloc[i, -1] > threshold else 0
data5_type.iloc[i, -1] = value
elif replace == 'colnames':
# 指定变量二值化
temp = data5_type[colnames]
if colnames.__len__() > 1:
data_binary = prep.binarize(temp, threshold=threshold,
copy=True) # 按照阈值threshold将数据转换成成0-1,小于等于threshold为 0
data5_type[colnames] = pd.DataFrame(data=data_binary, columns=temp.columns, index=temp.index)
else:
for i in range(nrow):
value = 1 if temp.values[i] > threshold else 0
data5_type[colnames].values[i] = value
# 打印二值化后数据分布
print("数据分布")
print(data5_type.iloc[:, -1].value_counts())
# # 亚编码操作
# encoder = prep.OneHotEncoder()
# X_OH = encoder.fit_transform(data3_type) #
# df = pd.DataFrame(X_OH.toarray())
return data5_type
# def plot_learning_curve(self,estimator,X,y,train_sizes=np.linspace(.05, 0.2, 5)):
# # 判断是否有过拟合趋势
# plt.figure()
# train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, train_sizes=train_sizes)
# train_scores_mean = np.mean(train_scores, axis=1)
# train_scores_std = np.std(train_scores, axis=1)
# test_scores_mean = np.mean(test_scores, axis=1)
# test_scores_std = np.std(test_scores, axis=1)
#
# plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
# train_scores_mean + train_scores_std, alpha=0.1,
# color="r")
# plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
# test_scores_mean + test_scores_std, alpha=0.1, color="g")
# plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
# label="Training score")
# plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
# label="Cross-validation score")
#
# plt.xlabel("Training examples")
# plt.ylabel("Score")
# plt.legend(loc="best")
# ylim = (0.5, 1.01)
# if ylim:
# plt.ylim(ylim)
# plt.title('SVM')
# plt.show()
# return train_sizes
# 线性SVM
def linear_SVM(self, data5_type,rstate=1234):
# #######输入参数########
# data_type:价位段内相关数据
# #######################
data_train, data_test = train_test_split(data5_type,test_size=0.3,random_state=rstate) #
col_names = data_train.columns
X = data_train[col_names[:-1]]
y = data_train[col_names[-1]]
# plot_learning_curve(svm.SVC(), X, y, train_sizes=np.linspace(.05, 0.2, 5))
# # 因为过拟合,所以增大样本量
# plot_learning_curve(svm.SVC(), X, y, train_sizes=np.linspace(.25, 1, 5))
#
# plot_learning_curve(svm.SVC(), X[col_names[[11,12]]], y, train_sizes=np.linspace(.25, 1, 5))
#
# plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=8)),("svc", svm.SVC())]), X, y, train_sizes=np.linspace(.25, 1, 5))
# 使用网格搜索法,选择最佳参数
C = np.arange(0.001,0.01,0.002)
param = {'C':C} # 'kernel':kernel,
grid_svc = GridSearchCV(estimator=svm.LinearSVC(),
param_grid=param,
scoring='accuracy',cv=5,verbose=0)
grid_svc.fit(X, y)
# print(grid_svc.best_params_); print('; '); print(grid_svc.best_score_)
best_c = grid_svc.best_params_.get('C')
# 模型
estm = svm.LinearSVC(C=best_c)
estm.fit(X, y)
coef = estm.coef_
# print("Non-zero coefficients: %s" % np.nonzero(coef)[1])
coef_regression = pd.Series(index=['Intercept'] + X.columns.tolist(), data=[estm.intercept_[0]] + coef.tolist()[0])
# print("回归系数分别为:")
# print(coef_regression);print("\t")
X_test = data_test[col_names[:-1]]
y_test = data_test[col_names[-1]]
predictY = estm.predict(X_test)
acu = mt.accuracy_score(y_test, predictY)
# print('模型预测准确率为:%.2f%%' % (acu*100))
# print('\t')
return estm, coef_regression,acu
def forward_stepwise(self,function,data6_type):
'''
前向变量选择svm
:param function: 分类函数
:param data6_type: 分析数据
:return: 最终选择的变量数据
'''
colnames = data6_type.columns
cor = data6_type.corr()[colnames[-1]][:-1]
best_name = abs(cor).idxmax()
# print(best_name)
colnames2 = [best_name,colnames[-1]]
data7_type = data6_type[colnames2]
colnames = colnames.drop(best_name,1)
estm, coef_regression,acu = function(data7_type)
# print(colnames2)
for name in colnames[:-1]:
colnames2.insert(0,name)
# print(colnames2)
temp = data6_type[colnames2]
estm2, coef_regression2, acu2 = function(temp)
# print(acu2);print("和");print(acu)
if acu2 > acu:
data7_type = temp.copy()
coef_regression = coef_regression2.copy()
acu = acu2.copy()
else:
colnames2.remove(name)
print("回归系数分别为:")
print(coef_regression); print("\t")
print('模型预测准确率为:%.2f%%' % (acu * 100))
print('\t')
return data7_type
def backward_stepwise(self, function, data6_type):
'''
后向变量删除svm
:param function: 分类函数
:param data6_type: 分析数据
:return: 最终筛选出来的变量数据
'''
colnames = data6_type.columns[:-1]
estm, coef_regression, acu = function(data6_type)
data7_type = data6_type.copy()
for name in colnames:
temp = data7_type.drop(name, 1)
estm2, coef_regression2, acu2 = function(temp)
# print(acu2); print("和"); print(acu)
if acu2 > acu:
data7_type = temp.copy()
coef_regression = coef_regression2.copy()
acu = acu2.copy()
print("回归系数分别为:")
print(coef_regression);
print("\t")
print('模型预测准确率为:%.2f%%' % (acu * 100))
print('\t')
return data7_type
def nonlinear_SVM(self, data6_type,rstate=1234):
'''
非线性svm
:param data6_type: 分析数据
:return: 模型估计和准确率
'''
data_train,data_test = train_test_split(data6_type,test_size=0.25,random_stat=rstate)
col_names = data_train.columns
X = data_train[col_names[:-1]]
y = data_train[col_names[-1]]
# plot_learning_curve(svm.SVC(), X, y, train_sizes=np.linspace(.05, 0.2, 5))
# # 因为过拟合,所以增大样本量
# plot_learning_curve(svm.SVC(), X, y, train_sizes=np.linspace(.25, 1, 5))
#
# plot_learning_curve(svm.SVC(), X[col_names[[11,12]]], y, train_sizes=np.linspace(.25, 1, 5))
#
# plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=8)),("svc", svm.SVC())]), X, y, train_sizes=np.linspace(.25, 1, 5))
# 使用网格搜索法,选择最佳C值
kernel = ['rbf','linear','poly','sigmoid'] # ,'linear','poly','sigmoid'
C = np.arange(0.001,0.01,0.003)
gamma = np.arange(0.001, 0.01, 0.003)
param = {'kernel':kernel,'C':C,'gamma':gamma} #
grid_svc = GridSearchCV(estimator=svm.SVC(),
param_grid=param,
scoring='accuracy',cv=5,verbose=1)
grid_svc.fit(X, y)
# print(grid_svc.best_params_); print('; '); print(grid_svc.best_score_)
best_c = grid_svc.best_params_.get('C')
best_kernel = grid_svc.best_params_.get('kernel')
best_gamma = grid_svc.best_params_.get('gamma')
estm = svm.SVC(C=best_c,kernel=best_kernel,gamma=best_gamma)
estm.fit(X, y)
X_test = data_test[col_names[:-1]]
y_test = data_test[col_names[-1]]
predictY = estm.predict(X_test)
acu = mt.accuracy_score(y_test, predictY)
print('模型误差为:%.2f%%' % (acu*100))
print('\t')
return estm, acu
def data_predict(self,data,colnames,estm):
# #######输入参数########
# data_type:价位段内相关数据;
# model:用来做预测的分析模型
# #######################
# # 利用回归模型预测优先投放城市
data_new = data[data['year1']==2016]
data_new2 = data_new.drop(['year1', 'year', 'type_id'], 1) # 删除无关变量
X = data_new2[colnames]
# data_g = self.data_group_byType(data_new2)
predictY = estm.predict(X)
result = pd.Series(index=X.index.tolist(),data=predictY.tolist()) # 城市增长预测
incr_ccom = result[result=='1'].index
return incr_ccom
if __name__ == '__main__':
# ##文件路径
data_path = 'C:\\Users\\90539\\PycharmProjects\\data\\'
file_name = 'data.xlsx'
typeid = ['B']
obj2 = SupportVectorMachine() # 创建实体
data, data1 = obj2.data_read(data_path,file_name, typeid)
data2 = obj2.outlier_filtrate(data1, method='std', fill='nan')
data3 = obj2.corr_filtrate(data2, thred_corr=0.4)
data4 = obj2.vif_filtrate(data3, thred_vif=4)
data5 = obj2.data_scale(data4, method='normalize')
data6 = obj2.data_factor(data5, replace='dependent', threshold=0.05)
rstate = rd.randint(9, 999)
estm, coef, acu = obj2.linear_SVM(data6,rstate) # 线性SVM
i = 0
while i < 100 and acu < 0.8: # 迭代找出最优参数
i += 1
rstate = rd.randint(9, 999)
estm2, coef2, acu2 = obj2.linear_SVM(data6,rstate) #
if acu2 > acu:
j = i
estm = estm2
coef = coef2
acu = acu2
print("种子编号:" + str(rstate))
print("第" + str(j + 1) + "次跑出最优结果:" + "\t")
print('模型预测准确率为:%.2f%%' % (acu * 100))
print("线性SVM回归系数分别为:")
print(coef);print("\t")
# print("非线性SVM:")
# estm2, acu2 = obj2.nonlinear_SVM(data6) # 非线性SVM
# print("前向线性SVM:")
# data7 = obj2.forward_stepwise(obj2.linear_SVM,data6)
# print("反向线性SVM:")
# data7 = obj2.backward_stepwise(obj2.linear_SVM, data6)
















