📋文章目录
- 原图
- 复现
- 准备数据集及数据处理
- 构建不同分类随机森林模型的并行计算
- 绘制随机森林变量重要性柱状图
- 计算数据集的相关性
- 热图可视化
- 合并随机森林重要性和热图
- 附上所有代码
在文献中,我们经常遇到随机森林和相关性热图的组合图片(下图),它由一幅叠加变量重要性圆圈的相关性热图和一幅说明因变量被解释程度的条形图组成。今天,我们将试着用自己的数据在R里面去复现这类图。
原图
先看看所需复现的随机森林变量重要性+相关性热图:
这类图片由两部分组成。其中容易理解的就是下面的相关性热图,它是核心微生物(y轴)与土壤养分变量(x轴)的相关性分析。剩下的圆圈和上面的条形图都可以归类到随机森林分析部分。它将核心微生物作为自变量,分别将各个土壤养分变量作为因变量,进行随机森林分析。每一次随机森林分析得到的变量重要性指数则映射为热图里的圆圈大小(只收集p<0.05的变量的重要性),而模型对目标土壤养分变量的解释程度则作为上面条形图柱子的高度。接下来,我们来模仿复现这张图。
复现
准备数据集及数据处理
rm(list = ls())
# 在R中绘制随机森林模型中变量重要性和相关性的热图
library(vegan)
library(randomForest)
library(rfUtilities)
library(rfPermute)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(export)
library(reshape2)
Micro <- openxlsx::read.xlsx("OTU1.xlsx", 1)
ENV <- openxlsx::read.xlsx("ENV1.xlsx", 1);head(ENV)
# Sample Ecosystem SOC pH CEC TP TK AK TN NO3 NH4 DOC
# 1 A1 Grassland 27.05671 7.904906 15.323426 0.9808826 14.75433 180.5585 2.1198671 125.335557 0.3171696 118.5123
# 2 A2 Grassland 25.76547 8.573666 18.104304 0.8527650 14.69093 166.8692 2.4470941 107.296478 2.4062394 124.6120
# 3 A3 Forest 12.33270 8.277955 8.297831 0.4298910 12.94322 121.5713 0.7090047 6.961546 2.1797286 129.7396
# 4 A4 Forest 16.63684 7.150481 15.632651 0.3711634 13.53909 194.0584 1.0868895 23.967231 29.2321159 129.3390
# 5 A5 Cropland 131.72950 7.548969 30.291618 0.6691908 14.19519 144.4154 5.4975148 38.032274 17.6463368 106.5602
# 6 A6 Cropland 123.75436 7.840254 34.357811 0.5828902 14.51699 387.8284 7.9689639 59.413622 12.7478292 105.8368
# MAT SD Pl
# 1 2.0162632 1.855115 274.0981
# 2 2.1286678 2.370189 369.1070
# 3 3.5848455 2.037255 263.5001
# 4 2.6846039 2.026378 346.7367
# 5 0.5787488 2.380261 333.7921
# 6 0.6756787 1.590068 240.1557
# 计算香浓指数,并把它合并在环境变量数据中
ENV$shannon <- diversity(t(Micro), index = "shannon")
# 生成一个总的与先前的不同生态系统合并成一个新的数据集
ENV_total <- ENV
ENV_total$Ecosystem <- "Whole"
ENV_new <- rbind(ENV_total, ENV)
# 确定不同生态系统类型的数量
unique(ENV_new$Ecosystem)
# [1] "Whole" "Grassland" "Forest" "Cropland" "Wetland" "Desert"
自变量数据是一组虚拟的土壤理化指标数据,它来自5个生态系统:草地、森林、农田、湿地、荒漠,和一个总体的。随后将采样点微生物的香浓指数合并到环境变量中。我们计划将数据按生态系统包括总体共6类,计算每一类生态系统中土壤环境变量对微生物香浓指数的潜在贡献。这里的思路是通过并行运算每类的随机森林模型,提高运行速度。
构建不同分类随机森林模型的并行计算
# 构建自定义函数进行并行运算
library(foreach)
library(doParallel)
# 设置核心数
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
registerDoParallel(cl)
# 获取生态系统列表
ecosystems <- unique(ENV_new$Ecosystem)
# 设置输出结果名称
options <- list()
options$results <- ecosystems
# 定义一个函数,对每个生态系统运行随机森林分析
run_single_rf <- function(data, eco) {
if (!"Ecosystem" %in% names(data)) {
stop("Error: 'Ecosystem' variable not found in data.")
}
if (!(eco %in% unique(data$Ecosystem))) {
stop("Error: Invalid Ecosystem value.")
}
# 对当前生态系统的数据进行子集
eco_data <- subset(data, Ecosystem == eco)[,-1:-2]
# 设置种子
set.seed(1)
# 进行随机森林分析和置换检验
RF <- randomForest(shannon~., eco_data, importance = T)
RF_r2 <- rf.significance(RF, eco_data, nperm=99, ntree=501)$Rsquare
RF_permuted <- rfPermute(shannon~., eco_data, nperm=99, ntree=501)
# 返回结果
list(ecosystem = eco, RF = RF, RF_r2 = RF_r2, RF_permuted = RF_permuted)
}
results <- foreach(i = ecosystems, .combine = rbind) %dopar% {
library(randomForest)
library(rfUtilities)
library(rfPermute)
# 运行单个随机森林分析
run_single_rf(data = ENV_new, eco = i)
}
# 停止并行处理
stopCluster(cl)
registerDoSEQ()
# 确认结果
head(results)
# ecosystem RF RF_r2 RF_permuted
# result.1 "Whole" randomForest.formula,18 0.5722844 rfPermute,6
# result.2 "Grassland" randomForest.formula,18 0.6213995 rfPermute,6
# result.3 "Forest" randomForest.formula,18 0.4347413 rfPermute,6
# result.4 "Cropland" randomForest.formula,18 0.4695682 rfPermute,6
# result.5 "Wetland" randomForest.formula,18 0.1822965 rfPermute,6
# result.6 "Desert" randomForest.formula,18 0.5186782 rfPermute,6
results[, "RF_permuted"][[1]]
# An rfPermute model
#
# Type of random forest: regression
# Number of trees: 501
# No. of variables tried at each split: 4
# No. of permutation replicates: 100
# Start time: 2023-04-18 22:56:38
# End time: 2023-04-18 22:57:05
# Run time: 26.8 secs
这个results结果看起来想data.frame,实际上是一个Large list。
使用随机森林计算整体数据中环境生物因子对微生物香浓指数的重要性。
- 这里的%Var explained就是我们画柱状图时柱子的高度,大家可以抄录下来。
- importance函数调取了每个变量对shannon指数的重要性,用它们定义热图中空心圆的大小,但我们只需要调取p值小于0.05的变量。(将在后文统一调用)
绘制随机森林变量重要性柱状图
# 绘制柱状图
names <- c("ecosystem", "RF_r2")
bar <- as.data.frame(results[,names])
bar$ecosystem <- unlist(bar$ecosystem)
bar$RF_r2 <- unlist(bar$RF_r2)
str(bar)
# 'data.frame': 6 obs. of 2 variables:
# $ ecosystem: chr "Whole" "Grassland" "Forest" "Cropland" ...
# $ RF_r2 : num 0.572 0.621 0.435 0.47 0.182 ...
colnames(bar) <- c("variable", "Exp")
bar$variable<- factor(bar$variable)
barplot<- ggplot(bar, aes(variable, Exp))+
geom_bar(stat = "identity", fill = "steelblue")+
scale_y_continuous(expand = c(0,0))+
theme_classic(base_line_size = 0.75)+
theme(panel.background = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_text(color = "black", size = 12))+
labs(title = "Explained variation (%)", size = 12)
barplot
六组随机森林分析的解释量条形图已经完成,我们可以计算并绘制相关性热图部分。
计算数据集的相关性
使用cor()
函数直接计算矩阵的相关性,相关矩阵的第14行,1到13列正是我们需要的香浓指数与13个环境因子的相关性分析结果。把它整理在一个表格内,并整理成绘图用的tidy数据,并对环境因子进行排序
注: cor()默认使用Pearson算法,如果对数据正态性不自信的话,建议修改为Spearman算法。
# 打包6个分类的随机森林模型结果
rf_list <- results[, "RF_permuted"]
# 修改6个分类模型的名称,改为对应的分类
names(rf_list) <- ecosystems
# 生成变量重要性的值,如果p值小于0.05则表示对应变量解释量
important_vars <- do.call(rbind, lapply(seq_along(results[, "RF_permuted"]), function(i) data.frame(ENV = row.names(importance(rf_list[[i]])),
Important = ifelse(importance(rf_list[[i]])[, 2] < 0.05,
importance(rf_list[[i]])[, 1], NA),
Eco = rep(names(rf_list[i]), each = nrow(importance(rf_list[[i]]))))))
# 合并数据
circle <- data.frame(ENV = colnames(ENV)[3:15]) %>%
left_join(important_vars) %>%
melt(id = c("ENV", "Eco"), value.name = "Importance"); circle
# Joining, by = "ENV"
# ENV Eco variable Importance
# 1 SOC Whole Important NA
# 2 SOC Grassland Important NA
# 3 SOC Forest Important 13.227866
# 4 SOC Cropland Important NA
# 5 SOC Wetland Important NA
# 6 SOC Desert Important NA
# 7 pH Whole Important NA
# 8 pH Grassland Important NA
# 9 pH Forest Important NA
# 10 pH Cropland Important NA
# 11 pH Wetland Important NA
# 12 pH Desert Important NA
# 13 CEC Whole Important NA
# 14 CEC Grassland Important NA
# 15 CEC Forest Important NA
# 16 CEC Cropland Important 6.204274
# 17 CEC Wetland Important NA
# 18 CEC Desert Important NA
# 19 TP Whole Important NA
# 20 TP Grassland Important NA
# 21 TP Forest Important NA
# 22 TP Cropland Important NA
# 23 TP Wetland Important NA
# 24 TP Desert Important NA
# 25 TK Whole Important NA
# 26 TK Grassland Important NA
# 27 TK Forest Important NA
# 28 TK Cropland Important 6.969350
# 29 TK Wetland Important NA
# 30 TK Desert Important 6.783699
# 31 AK Whole Important NA
# 32 AK Grassland Important NA
# 33 AK Forest Important NA
# 34 AK Cropland Important 7.491691
# 35 AK Wetland Important 7.208659
# 36 AK Desert Important 6.601446
# 37 TN Whole Important NA
# 38 TN Grassland Important NA
# 39 TN Forest Important 6.953556
# 40 TN Cropland Important NA
# 41 TN Wetland Important NA
# 42 TN Desert Important NA
# 43 NO3 Whole Important 5.075570
# 44 NO3 Grassland Important NA
# 45 NO3 Forest Important NA
# 46 NO3 Cropland Important NA
# 47 NO3 Wetland Important NA
# 48 NO3 Desert Important NA
# 49 NH4 Whole Important 4.594003
# 50 NH4 Grassland Important NA
# 51 NH4 Forest Important 6.911031
# 52 NH4 Cropland Important NA
# 53 NH4 Wetland Important NA
# 54 NH4 Desert Important NA
# 55 DOC Whole Important 14.319739
# 56 DOC Grassland Important 18.475196
# 57 DOC Forest Important 13.514009
# 58 DOC Cropland Important NA
# 59 DOC Wetland Important NA
# 60 DOC Desert Important 5.634245
# 61 MAT Whole Important NA
# 62 MAT Grassland Important NA
# 63 MAT Forest Important NA
# 64 MAT Cropland Important NA
# 65 MAT Wetland Important NA
# 66 MAT Desert Important NA
# 67 SD Whole Important NA
# 68 SD Grassland Important NA
# 69 SD Forest Important 6.683399
# 70 SD Cropland Important 5.815041
# 71 SD Wetland Important NA
# 72 SD Desert Important NA
# 73 Pl Whole Important 26.440467
# 74 Pl Grassland Important 12.635259
# 75 Pl Forest Important NA
# 76 Pl Cropland Important 12.873442
# 77 Pl Wetland Important 7.492399
# 78 Pl Desert Important 15.139708
circle$ENV<- factor(circle$ENV, rev(colnames(ENV)[3:15]))
# 划分不同生态系统的数据集
df_eco <- split(ENV[, -1:-2], ENV$Ecosystem)
names(df_eco)
# 计算不同生态系统的相关性
r <- data.frame(ENV = colnames(ENV)[3:15],
Whole = (cor(ENV[, -1:-2]))[14, 1:13],
Grassland = (cor(df_eco$Grassland))[14, 1:13],
Forest = (cor(df_eco$Forest))[14, 1:13],
Cropland = (cor(df_eco$Cropland))[14, 1:13],
Wetland = (cor(df_eco$Wetland))[14, 1:13],
Desert = (cor(df_eco$Desert))[14, 1:13])%>%
melt(id = "ENV", value.name = "Correlation");r
# ENV variable Correlation
# 1 SOC Whole 0.190880826
# 2 pH Whole -0.047988531
# 3 CEC Whole 0.375623831
# 4 TP Whole 0.204011032
# 5 TK Whole 0.282682786
# 6 AK Whole -0.238701400
# 7 TN Whole 0.205713695
# 8 NO3 Whole -0.213326858
# 9 NH4 Whole -0.076868083
# 10 DOC Whole 0.417261347
# 11 MAT Whole -0.336919922
# 12 SD Whole -0.272103426
# 13 Pl Whole 0.590955861
# 14 SOC Grassland 0.389609962
# 15 pH Grassland -0.292803759
# 16 CEC Grassland 0.466440818
# 17 TP Grassland 0.240020091
# 18 TK Grassland 0.250093957
# 19 AK Grassland 0.295479703
# 20 TN Grassland 0.286305242
# 21 NO3 Grassland 0.060433644
# 22 NH4 Grassland -0.179338677
# 23 DOC Grassland 0.449023819
# 24 MAT Grassland -0.445949001
# 25 SD Grassland -0.511629899
# 26 Pl Grassland 0.657737041
# 27 SOC Forest 0.419911131
# 28 pH Forest -0.259555477
# 29 CEC Forest 0.441782003
# 30 TP Forest -0.043325378
# 31 TK Forest 0.184701577
# 32 AK Forest 0.314240260
# 33 TN Forest 0.394795298
# 34 NO3 Forest 0.191051827
# 35 NH4 Forest -0.062053889
# 36 DOC Forest 0.472024615
# 37 MAT Forest -0.246051709
# 38 SD Forest -0.427179574
# 39 Pl Forest 0.445715297
# 40 SOC Cropland 0.190504600
# 41 pH Cropland 0.079739140
# 42 CEC Cropland 0.372432772
# 43 TP Cropland 0.173902720
# 44 TK Cropland 0.339013659
# 45 AK Cropland -0.567700216
# 46 TN Cropland 0.179234867
# 47 NO3 Cropland -0.426516867
# 48 NH4 Cropland 0.047660104
# 49 DOC Cropland 0.431862312
# 50 MAT Cropland -0.388846551
# 51 SD Cropland -0.129670246
# 52 Pl Cropland 0.693855669
# 53 SOC Wetland 0.062759256
# 54 pH Wetland -0.230794196
# 55 CEC Wetland 0.292344133
# 56 TP Wetland 0.170356678
# 57 TK Wetland 0.265791237
# 58 AK Wetland 0.323268430
# 59 TN Wetland -0.004741411
# 60 NO3 Wetland 0.172574329
# 61 NH4 Wetland -0.160782185
# 62 DOC Wetland 0.436793024
# 63 MAT Wetland -0.386445255
# 64 SD Wetland -0.189241202
# 65 Pl Wetland 0.343211982
# 66 SOC Desert 0.008834840
# 67 pH Desert 0.298885289
# 68 CEC Desert 0.155105965
# 69 TP Desert 0.253336024
# 70 TK Desert 0.485088124
# 71 AK Desert -0.554681227
# 72 TN Desert 0.173607814
# 73 NO3 Desert -0.256845001
# 74 NH4 Desert -0.144874726
# 75 DOC Desert 0.443678302
# 76 MAT Desert -0.263961749
# 77 SD Desert -0.160875488
# 78 Pl Desert 0.648853903
r$ENV <- factor(r$ENV, levels = rev(colnames(ENV)[3:15]))
# 合并相关性与随机森林重要性的结果
r <- r %>% left_join(circle[c("ENV", "Importance")], by = "ENV");head(r)
# ENV variable Correlation Importance
# 1 SOC Whole 0.1908808 NA
# 2 SOC Whole 0.1908808 NA
# 3 SOC Whole 0.1908808 13.22787
# 4 SOC Whole 0.1908808 NA
# 5 SOC Whole 0.1908808 NA
# 6 SOC Whole 0.1908808 NA
热图可视化
heatmap <- ggplot()+
geom_tile(data = r, aes(x = variable, y = ENV, fill = Correlation))+
scale_fill_gradientn(colors = c('#2D6DB1', 'white', '#DC1623'),
limit = c(-1, 1))+
geom_point(data = circle, aes(x = Eco, y = ENV,
size = Importance), shape = 21)+
theme_bw()+
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, color = "black",
size = 12, vjust = 0.6),
axis.text.y = element_text(color = 'black', size = 12),
legend.title = element_text(size = 10))+
labs(y = '', x = '')
heatmap
合并随机森林重要性和热图
# barplot + heatmap +
# plot_layout(ncol = 1, heights = c(1, 6))
barplot %>%
aplot::insert_bottom(heatmap, height = 6)
# graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)
从结果来看,可以说是完美复刻!
附上所有代码
rm(list = ls())
# 在R中绘制随机森林模型中变量重要性和相关性的热图
library(vegan)
library(randomForest)
library(rfUtilities)
library(rfPermute)
library(ggplot2)
library(tidyverse)
library(patchwork)
library(export)
library(reshape2)
Micro <- openxlsx::read.xlsx("OTU1.xlsx", 1)
ENV <- openxlsx::read.xlsx("ENV1.xlsx", 1);head(ENV)
# Sample Ecosystem SOC pH CEC TP TK AK TN NO3 NH4 DOC
# 1 A1 Grassland 27.05671 7.904906 15.323426 0.9808826 14.75433 180.5585 2.1198671 125.335557 0.3171696 118.5123
# 2 A2 Grassland 25.76547 8.573666 18.104304 0.8527650 14.69093 166.8692 2.4470941 107.296478 2.4062394 124.6120
# 3 A3 Forest 12.33270 8.277955 8.297831 0.4298910 12.94322 121.5713 0.7090047 6.961546 2.1797286 129.7396
# 4 A4 Forest 16.63684 7.150481 15.632651 0.3711634 13.53909 194.0584 1.0868895 23.967231 29.2321159 129.3390
# 5 A5 Cropland 131.72950 7.548969 30.291618 0.6691908 14.19519 144.4154 5.4975148 38.032274 17.6463368 106.5602
# 6 A6 Cropland 123.75436 7.840254 34.357811 0.5828902 14.51699 387.8284 7.9689639 59.413622 12.7478292 105.8368
# MAT SD Pl
# 1 2.0162632 1.855115 274.0981
# 2 2.1286678 2.370189 369.1070
# 3 3.5848455 2.037255 263.5001
# 4 2.6846039 2.026378 346.7367
# 5 0.5787488 2.380261 333.7921
# 6 0.6756787 1.590068 240.1557
# 计算香浓指数,并把它合并在环境变量数据中
ENV$shannon <- diversity(t(Micro), index = "shannon")
# 生成一个总的与先前的不同生态系统合并成一个新的数据集
ENV_total <- ENV
ENV_total$Ecosystem <- "Whole"
ENV_new <- rbind(ENV_total, ENV)
# 确定不同生态系统类型的数量
unique(ENV_new$Ecosystem)
# [1] "Whole" "Grassland" "Forest" "Cropland" "Wetland" "Desert"
# # 使用随机森林计算各生态系统中环境生物因子对微生物香浓指数的重要性
# set.seed(1)#设置种子,确保结果一致
# RF_total <- randomForest(shannon ~ . , ENV[, -1:-2], importance = T)
# RFs_total <- rfPermute(shannon ~ . , ENV[, -1:-2])
# RF_total;importance(RFs_total)[,1:2]
# # 这里的%Var explained就是我们画柱状图时柱子的高度,大家可以整理下来
# # importance函数调取了每个变量对shannon指数的重要性,用它们定义热图中空心圆的大小
# 构建自定义函数进行并行运算
library(foreach)
library(doParallel)
# 设置核心数
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
registerDoParallel(cl)
# 获取生态系统列表
ecosystems <- unique(ENV_new$Ecosystem)
# 设置输出结果名称
options <- list()
options$results <- ecosystems
# 定义一个函数,对每个生态系统运行随机森林分析
run_single_rf <- function(data, eco) {
if (!"Ecosystem" %in% names(data)) {
stop("Error: 'Ecosystem' variable not found in data.")
}
if (!(eco %in% unique(data$Ecosystem))) {
stop("Error: Invalid Ecosystem value.")
}
# 对当前生态系统的数据进行子集
eco_data <- subset(data, Ecosystem == eco)[,-1:-2]
# 设置种子
set.seed(1)
# 进行随机森林分析和置换检验
RF <- randomForest(shannon~., eco_data, importance = T)
RF_r2 <- rf.significance(RF, eco_data, nperm=99, ntree=501)$Rsquare
RF_permuted <- rfPermute(shannon~., eco_data, nperm=99, ntree=501)
# 返回结果
list(ecosystem = eco, RF = RF, RF_r2 = RF_r2, RF_permuted = RF_permuted)
}
results <- foreach(i = ecosystems, .combine = rbind) %dopar% {
library(randomForest)
library(rfUtilities)
library(rfPermute)
# 运行单个随机森林分析
run_single_rf(data = ENV_new, eco = i)
}
# 停止并行处理
stopCluster(cl)
registerDoSEQ()
# 确认结果
head(results)
# ecosystem RF RF_r2 RF_permuted
# result.1 "Whole" randomForest.formula,18 0.5722844 rfPermute,6
# result.2 "Grassland" randomForest.formula,18 0.6213995 rfPermute,6
# result.3 "Forest" randomForest.formula,18 0.4347413 rfPermute,6
# result.4 "Cropland" randomForest.formula,18 0.4695682 rfPermute,6
# result.5 "Wetland" randomForest.formula,18 0.1822965 rfPermute,6
# result.6 "Desert" randomForest.formula,18 0.5186782 rfPermute,6
results[, "RF_permuted"][[1]]
# An rfPermute model
#
# Type of random forest: regression
# Number of trees: 501
# No. of variables tried at each split: 4
# No. of permutation replicates: 100
# Start time: 2023-04-18 22:56:38
# End time: 2023-04-18 22:57:05
# Run time: 26.8 secs
# 绘制柱状图
names <- c("ecosystem", "RF_r2")
bar <- as.data.frame(results[,names])
bar$ecosystem <- unlist(bar$ecosystem)
bar$RF_r2 <- unlist(bar$RF_r2)
str(bar)
# 'data.frame': 6 obs. of 2 variables:
# $ ecosystem: chr "Whole" "Grassland" "Forest" "Cropland" ...
# $ RF_r2 : num 0.572 0.621 0.435 0.47 0.182 ...
colnames(bar) <- c("variable", "Exp")
bar$variable<- factor(bar$variable)
barplot<- ggplot(bar, aes(variable, Exp))+
geom_bar(stat = "identity", fill = "steelblue")+
scale_y_continuous(expand = c(0,0))+
theme_classic(base_line_size = 0.75)+
theme(panel.background = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_text(color = "black", size = 12))+
labs(title = "Explained variation (%)", size = 12)
barplot
# 打包6个分类的随机森林模型结果
rf_list <- results[, "RF_permuted"]
# 修改6个分类模型的名称,改为对应的分类
names(rf_list) <- ecosystems
# 生成变量重要性的值,如果p值小于0.05则表示对应变量解释量
important_vars <- do.call(rbind, lapply(seq_along(results[, "RF_permuted"]), function(i) data.frame(ENV = row.names(importance(rf_list[[i]])),
Important = ifelse(importance(rf_list[[i]])[, 2] < 0.05,
importance(rf_list[[i]])[, 1], NA),
Eco = rep(names(rf_list[i]), each = nrow(importance(rf_list[[i]]))))))
# 合并数据
circle <- data.frame(ENV = colnames(ENV)[3:15]) %>%
left_join(important_vars) %>%
melt(id = c("ENV", "Eco"), value.name = "Importance"); circle
# Joining, by = "ENV"
# ENV Eco variable Importance
# 1 SOC Whole Important NA
# 2 SOC Grassland Important NA
# 3 SOC Forest Important 13.227866
# 4 SOC Cropland Important NA
# 5 SOC Wetland Important NA
# 6 SOC Desert Important NA
# 7 pH Whole Important NA
# 8 pH Grassland Important NA
# 9 pH Forest Important NA
# 10 pH Cropland Important NA
# 11 pH Wetland Important NA
# 12 pH Desert Important NA
# 13 CEC Whole Important NA
# 14 CEC Grassland Important NA
# 15 CEC Forest Important NA
# 16 CEC Cropland Important 6.204274
# 17 CEC Wetland Important NA
# 18 CEC Desert Important NA
# 19 TP Whole Important NA
# 20 TP Grassland Important NA
# 21 TP Forest Important NA
# 22 TP Cropland Important NA
# 23 TP Wetland Important NA
# 24 TP Desert Important NA
# 25 TK Whole Important NA
# 26 TK Grassland Important NA
# 27 TK Forest Important NA
# 28 TK Cropland Important 6.969350
# 29 TK Wetland Important NA
# 30 TK Desert Important 6.783699
# 31 AK Whole Important NA
# 32 AK Grassland Important NA
# 33 AK Forest Important NA
# 34 AK Cropland Important 7.491691
# 35 AK Wetland Important 7.208659
# 36 AK Desert Important 6.601446
# 37 TN Whole Important NA
# 38 TN Grassland Important NA
# 39 TN Forest Important 6.953556
# 40 TN Cropland Important NA
# 41 TN Wetland Important NA
# 42 TN Desert Important NA
# 43 NO3 Whole Important 5.075570
# 44 NO3 Grassland Important NA
# 45 NO3 Forest Important NA
# 46 NO3 Cropland Important NA
# 47 NO3 Wetland Important NA
# 48 NO3 Desert Important NA
# 49 NH4 Whole Important 4.594003
# 50 NH4 Grassland Important NA
# 51 NH4 Forest Important 6.911031
# 52 NH4 Cropland Important NA
# 53 NH4 Wetland Important NA
# 54 NH4 Desert Important NA
# 55 DOC Whole Important 14.319739
# 56 DOC Grassland Important 18.475196
# 57 DOC Forest Important 13.514009
# 58 DOC Cropland Important NA
# 59 DOC Wetland Important NA
# 60 DOC Desert Important 5.634245
# 61 MAT Whole Important NA
# 62 MAT Grassland Important NA
# 63 MAT Forest Important NA
# 64 MAT Cropland Important NA
# 65 MAT Wetland Important NA
# 66 MAT Desert Important NA
# 67 SD Whole Important NA
# 68 SD Grassland Important NA
# 69 SD Forest Important 6.683399
# 70 SD Cropland Important 5.815041
# 71 SD Wetland Important NA
# 72 SD Desert Important NA
# 73 Pl Whole Important 26.440467
# 74 Pl Grassland Important 12.635259
# 75 Pl Forest Important NA
# 76 Pl Cropland Important 12.873442
# 77 Pl Wetland Important 7.492399
# 78 Pl Desert Important 15.139708
circle$ENV<- factor(circle$ENV, rev(colnames(ENV)[3:15]))
# 划分不同生态系统的数据集
df_eco <- split(ENV[, -1:-2], ENV$Ecosystem)
names(df_eco)
# 计算不同生态系统的相关性
r <- NULL
r <- data.frame(ENV = colnames(ENV)[3:15],
Whole = (cor(ENV[, -1:-2]))[14, 1:13],
Grassland = (cor(df_eco$Grassland))[14, 1:13],
Forest = (cor(df_eco$Forest))[14, 1:13],
Cropland = (cor(df_eco$Cropland))[14, 1:13],
Wetland = (cor(df_eco$Wetland))[14, 1:13],
Desert = (cor(df_eco$Desert))[14, 1:13])%>%
melt(id = "ENV", value.name = "Correlation");r
# ENV variable Correlation
# 1 SOC Whole 0.190880826
# 2 pH Whole -0.047988531
# 3 CEC Whole 0.375623831
# 4 TP Whole 0.204011032
# 5 TK Whole 0.282682786
# 6 AK Whole -0.238701400
# 7 TN Whole 0.205713695
# 8 NO3 Whole -0.213326858
# 9 NH4 Whole -0.076868083
# 10 DOC Whole 0.417261347
# 11 MAT Whole -0.336919922
# 12 SD Whole -0.272103426
# 13 Pl Whole 0.590955861
# 14 SOC Grassland 0.389609962
# 15 pH Grassland -0.292803759
# 16 CEC Grassland 0.466440818
# 17 TP Grassland 0.240020091
# 18 TK Grassland 0.250093957
# 19 AK Grassland 0.295479703
# 20 TN Grassland 0.286305242
# 21 NO3 Grassland 0.060433644
# 22 NH4 Grassland -0.179338677
# 23 DOC Grassland 0.449023819
# 24 MAT Grassland -0.445949001
# 25 SD Grassland -0.511629899
# 26 Pl Grassland 0.657737041
# 27 SOC Forest 0.419911131
# 28 pH Forest -0.259555477
# 29 CEC Forest 0.441782003
# 30 TP Forest -0.043325378
# 31 TK Forest 0.184701577
# 32 AK Forest 0.314240260
# 33 TN Forest 0.394795298
# 34 NO3 Forest 0.191051827
# 35 NH4 Forest -0.062053889
# 36 DOC Forest 0.472024615
# 37 MAT Forest -0.246051709
# 38 SD Forest -0.427179574
# 39 Pl Forest 0.445715297
# 40 SOC Cropland 0.190504600
# 41 pH Cropland 0.079739140
# 42 CEC Cropland 0.372432772
# 43 TP Cropland 0.173902720
# 44 TK Cropland 0.339013659
# 45 AK Cropland -0.567700216
# 46 TN Cropland 0.179234867
# 47 NO3 Cropland -0.426516867
# 48 NH4 Cropland 0.047660104
# 49 DOC Cropland 0.431862312
# 50 MAT Cropland -0.388846551
# 51 SD Cropland -0.129670246
# 52 Pl Cropland 0.693855669
# 53 SOC Wetland 0.062759256
# 54 pH Wetland -0.230794196
# 55 CEC Wetland 0.292344133
# 56 TP Wetland 0.170356678
# 57 TK Wetland 0.265791237
# 58 AK Wetland 0.323268430
# 59 TN Wetland -0.004741411
# 60 NO3 Wetland 0.172574329
# 61 NH4 Wetland -0.160782185
# 62 DOC Wetland 0.436793024
# 63 MAT Wetland -0.386445255
# 64 SD Wetland -0.189241202
# 65 Pl Wetland 0.343211982
# 66 SOC Desert 0.008834840
# 67 pH Desert 0.298885289
# 68 CEC Desert 0.155105965
# 69 TP Desert 0.253336024
# 70 TK Desert 0.485088124
# 71 AK Desert -0.554681227
# 72 TN Desert 0.173607814
# 73 NO3 Desert -0.256845001
# 74 NH4 Desert -0.144874726
# 75 DOC Desert 0.443678302
# 76 MAT Desert -0.263961749
# 77 SD Desert -0.160875488
# 78 Pl Desert 0.648853903
r$ENV <- factor(r$ENV, levels = rev(colnames(ENV)[3:15]))
# 合并相关性与随机森林重要性的结果
r <- r %>% left_join(circle[c("ENV", "Importance")], by = "ENV");head(r)
# ENV variable Correlation Importance
# 1 SOC Whole 0.1908808 NA
# 2 SOC Whole 0.1908808 NA
# 3 SOC Whole 0.1908808 13.22787
# 4 SOC Whole 0.1908808 NA
# 5 SOC Whole 0.1908808 NA
# 6 SOC Whole 0.1908808 NA
heatmap <- ggplot()+
geom_tile(data = r, aes(x = variable, y = ENV, fill = Correlation))+
scale_fill_gradientn(colors = c('#2D6DB1', 'white', '#DC1623'),
limit = c(-1, 1))+
geom_point(data = circle, aes(x = Eco, y = ENV,
size = Importance), shape = 21)+
theme_bw()+
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, color = "black",
size = 12, vjust = 0.6),
axis.text.y = element_text(color = 'black', size = 12),
legend.title = element_text(size = 10))+
labs(y = '', x = '')
heatmap
# barplot + heatmap +
# plot_layout(ncol = 1, heights = c(1, 6))
# graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)
# barplot + heatmap +
# plot_layout(ncol = 1, heights = c(1, 6))
barplot %>%
aplot::insert_bottom(heatmap, height = 6)
graph2ppt(file = "ppt.ppt", height = 8, width = 5, append = T)