R语言将某项定量变量变为定性变量 r语言变量重要性

转载

mob64ca13faa4e6 2024-06-29 16:23:34

文章标签 R语言将某项定量变量变为定性变量线性回归 r语言加载拟合 文章分类 R语言后端开发

线性回归分及变量的相对重要性

作者：社会学的冯同学

简介：本文首先用一些数据集对R语言中关于线性回归的知识进行了回顾，然后介绍了如何运用R语言进行变量的相对重要性分析

先加载所需要的包

library(haven)
library(tidyverse)
library(texreg)#更好的模型输出
library(zoo)#加载lmtest包之前要加载
library(lmtest)#进行似然比检验
library(patchwork)#合成图形
library(ShapleyValue)#计算shapley值
library(DALEX)#可视化shapley值

读入数据

GPA <- read_dta("GPA.dta")
isei <- read_dta("isei.dta")
nlsw88 <- read_dta("nlsw88.dta")

例1：学生成绩与其室友成绩的关系

绘制散点图

#生成变量室友平均绩点
gpa <- GPA %>% 
  group_by(dorm) %>% 
  mutate(sumgpa = sum(GPA),
         summate = n(),
         mmgpa = (sumgpa - GPA)/(summate - 1) ) %>% 
  na.omit()
#用点图表示室友平均绩点和个人绩点的关系
ggplot(gpa, aes(GPA, mmgpa))+
  geom_point()

R语言将某项定量变量变为定性变量 r语言变量重要性_线性回归

拟合模型

fit.gpa <- lm(GPA ~ mmgpa, data = gpa)
screenreg(fit.gpa)

## 
## =======================
##              Model 1   
## -----------------------
## (Intercept)    1.96 ***
##               (0.36)   
## mmgpa          0.39 ***
##               (0.11)   
## -----------------------
## R^2            0.11    
## Adj. R^2       0.10    
## Num. obs.    109       
## =======================
## *** p < 0.001; ** p < 0.01; * p < 0.05

ggplot(gpa, aes(GPA, mmgpa))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

R语言将某项定量变量变为定性变量 r语言变量重要性_加载_02

例2：个体社会经济地位的影响因素

拟合不同的模型说明变量之间的关系

fit.isei.1 <- lm(isei_p ~ isei_f + isei_m, data = isei)
isei <-  isei %>% 
  mutate(lnincome = log(income + 1))
fit.isei.2 <- lm(isei_p ~ age + gender + edu + lnincome, data = isei)
fit.isei.3 <- lm(isei_p ~ 
                   isei_f + isei_m + age + gender + edu + lnincome,
                 data = isei)
screenreg(list(fit.isei.1, fit.isei.2, fit.isei.3))

## 
## ==================================================
##              Model 1      Model 2      Model 3    
## --------------------------------------------------
## (Intercept)    29.97 ***    -6.94 *      -6.44 *  
##                (0.74)       (3.15)       (3.13)   
## isei_f          0.19 ***                  0.09 ***
##                (0.02)                    (0.02)   
## isei_m          0.18 ***                 -0.03    
##                (0.03)                    (0.02)   
## age                          0.11 ***     0.09 ***
##                             (0.02)       (0.02)   
## gender                      -2.77 ***    -2.59 ***
##                             (0.48)       (0.48)   
## edu                          2.09 ***     2.01 ***
##                             (0.07)       (0.07)   
## lnincome                     2.23 ***     2.09 ***
##                             (0.31)       (0.31)   
## --------------------------------------------------
## R^2             0.08         0.32         0.32    
## Adj. R^2        0.08         0.31         0.32    
## Num. obs.    3051         3051         3051       
## ==================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

对于嵌套模型进行似然比检验

lrtest(fit.isei.3, fit.isei.2)

## Likelihood ratio test
## 
## Model 1: isei_p ~ isei_f + isei_m + age + gender + edu + lnincome
## Model 2: isei_p ~ age + gender + edu + lnincome
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -12019                         
## 2   6 -12033 -2 28.332  7.044e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

从结果中可以看出哪些模型p<0.05，所以选非限制模型（大模型）

赤池信息准则AIC，贝叶斯信息准则BIC，可用于非嵌套模型检验，越小越好

AIC(fit.isei.3)

## [1] 24054.38

BIC(fit.isei.3)

## [1] 24102.56

对数变换

isei <- isei %>% 
  mutate(ln_income = log(income),
         ln_income2 = log(income + 1))
p1 <- ggplot(isei, aes(income))+
  geom_histogram()
p2 <- ggplot(isei, aes(ln_income))+
  geom_histogram()
p3 <- ggplot(isei, aes(ln_income2))+
  geom_histogram()
p1 + p2 + p3

R语言将某项定量变量变为定性变量 r语言变量重要性_拟合_03

处理离群值

对nlsw88数据集中的wage变量进行可视化发现离群值

ggplot(nlsw88, aes(wage))+
  geom_histogram()

R语言将某项定量变量变为定性变量 r语言变量重要性_线性回归_04

比较在拟合模型时去掉这些离群值的结果和不去掉这些离群值的结果

fit1 <- lm(wage ~ age + ttl_exp + collgrad, data = nlsw88)
nlsw <- nlsw88 %>% 
  filter(wage < 20)
fit2 <- lm(wage ~ age + ttl_exp + collgrad, data = nlsw)
screenreg(list(fit1, fit2))

## 
## =====================================
##              Model 1      Model 2    
## -------------------------------------
## (Intercept)     7.93 ***     4.46 ***
##                (1.46)       (0.83)   
## age            -0.12 **     -0.04 *  
##                (0.04)       (0.02)   
## ttl_exp         0.31 ***     0.28 ***
##                (0.02)       (0.01)   
## collgrad        3.24 ***     2.76 ***
##                (0.27)       (0.15)   
## -------------------------------------
## R^2             0.13         0.27    
## Adj. R^2        0.13         0.27    
## Num. obs.    2246         2174       
## =====================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

缩尾和截尾

缩尾是指：把将变量中小于其 2.5 百分位的数值替换为其 2.5 百分位数值；
将变量中大于其 97.5 百分位的数值替换为其 97.5 百分位数值

首先用箱线图对wage变量的分布进行检测

R语言将某项定量变量变为定性变量 r语言变量重要性_加载_05

boxplot(nlsw88$wage)

有很多大于1.5倍75%分位数的数据，所以要对wage变量进行缩尾和截尾处理

缩尾

```r
x <- nlsw88$wage
winsor <- quantile(x, probs = c(0.025, 0.975), na.rm = T)
winsor

##      2.5%     97.5% 
##  2.509832 24.579301

x[x < winsor[1]] <- winsor[1]
x[x > winsor[2]] <- winsor[2]
nlsw88$wage <- x
range(nlsw88$wage)

## [1]  2.509832 24.579301

#可视化
ggplot(nlsw88, aes(wage))+
  geom_histogram()

R语言将某项定量变量变为定性变量 r语言变量重要性_拟合_06

从图中可以看出来，缩尾后，对应于之前变量的2.5和97.5百分位上的数值变多了，这是由原来超过该范围的"离群值"转换而来的。但wage变量的原始数据似乎只在右侧存在离群值,在左侧并不存在离群值，所以可以只进行单侧缩尾。

nlsw88 <- read_dta("nlsw88.dta")
x <- nlsw88$wage
winsor <- quantile(x, probs = 0.975, na.rm = T)
winsor[1]

##   97.5% 
## 24.5793

x[x > winsor[1]] <- winsor[1]
nlsw88$wage <- x
range(nlsw88$wage)

## [1]  1.004952 24.579301

ggplot(nlsw88, aes(wage))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R语言将某项定量变量变为定性变量 r语言变量重要性_R语言将某项定量变量变为定性变量_07

截尾

nlsw88 <- read_dta("nlsw88.dta")
nlsw88 <- nlsw88 %>% 
  filter(wage > 2.509832 & wage < 24.507090)
ggplot(data = nlsw88, aes(wage))+
  geom_histogram()

R语言将某项定量变量变为定性变量 r语言变量重要性_线性回归_08

右侧截尾

nlsw88 <- read_dta("nlsw88.dta")
nlsw88 <- nlsw88 %>% 
  filter(wage < 24.507090)
ggplot(nlsw88, aes(wage))+
  geom_histogram()

R语言将某项定量变量变为定性变量 r语言变量重要性_线性回归_09

相对重要性分解

shaply值分解法

nlsw88 <- read_dta("nlsw88.dta")
#计算shapley值
y <- nlsw$wage
x <- as.data.frame(nlsw[,2:4])
shapleyvalue(y,x)

##                               age  race married
## Shapley Value              0.0000 0.008  0.0007
## Standardized Shapley Value 0.0036 0.912  0.0844

可视化shapley值

nlsw88 <- read_dta("nlsw88.dta")
fit <- lm(wage ~ collgrad + ttl_exp + hours, data = nlsw88)
nlsw <- nlsw88 %>% 
  select(wage, ttl_exp, hours, collgrad)
ranger_exp <- explain(fit,
                      data = nlsw[,-1],
                      y = nlsw[,1])

## Preparation of a new explainer is initiated
##   -> model label       :  lm  (  default  )
##   -> data              :  2246  rows  3  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
##   -> target variable   :  2246  values 
##   -> predict function  :  yhat.lm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.2.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  NA , mean =  NA , max =  NA  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
##   A new explainer has been created!

fifa_vi <- model_parts(ranger_exp)
plot(fifa_vi, show_boxplots = F)

R语言将某项定量变量变为定性变量 r语言变量重要性_拟合_10

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：redis哨兵为什么会发生主从切换 redis 哨兵主从切换

下一篇：java 实体类的字段设置时区 java时间属性

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯