这是R语言科学可视化的第四篇教程,将继续讲解ggplot的不同可视化类型
本教程参考书籍《R语言可视化之美》、《R Graphics Cookbook》、《R语言可视化教程》、《ggplot2: Elegant Graphics for Data Analysis 》等。
可视化数据处理(下)
数据整理
(1)拼接-cbind/rbind
再r语言数据处理过程中,经常会涉及不同数据框的合并,或列合并或行合并,cbind和rbind是两个常见的向量或数据框合并函数,我们通过两个小例子熟悉一下:
# 创建两个向量
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
# 使用 cbind 按列合并这两个向量
result_cbind <- cbind(vector1, vector2)
# 创建两个数据框
df1 <- data.frame(name = c("Alice", "Bob"), age = c(25, 30))
df2 <- data.frame(name = c("Charlie", "David"), age = c(28, 35))
# 使用 rbind 按行合并这两个数据框
result_rbind <- rbind(df1, df2)
(2)数据融合:
当数据并不是非常规整的形式的时候,我们可以使用merge或join函数来融合,一般可以根据某一列来进行融合,实践如下:我们先创建四个df
df1 <- data.frame(x= c('a','b','c'),y = 1:3)
df2 <- data.frame(x= c('a','b','d'),z = c(7,5,3))
df3 <- data.frame(g= c('a','b','d'),z = c(7,5,3))
df4 <- data.frame(x= c('a','b','d'),y = c(1,4,2),z = c(7,5,3))
merge函数合并,结果可以自己看一下新生成的df,就懂了
dm1 <- merge(df1,df2,by = 'x',all = TRUE) #按照x列合并,并包含所有行
dm2 <- merge(df1,df3,by.x = 'x', by.y = 'g')#两个df分别按照指定列合并交集项
join函数合并,结果也看一下生成的df,其实join更加灵活
dj1 <- dplyr::left_join(x=df1,y=df2,by = 'x')#按照x列融合,但是只保留第一个表里的全部行
dj2 <- dplyr::right_join(x=df1,y=df2,by = 'x')#按照x列融合,但是只保留第二个表里的全部行
dj3 <- dplyr::inner_join(x=df1,y=df2,by = 'x')#按照x列融合,但是只保留交集的行
dj4 <- dplyr::full_join(x=df1,y=df2,by = 'x')#按照x列融合,保留全部行
(3)数据分组
我们在遇到数据中有非连续型变量,也就是分类变量的时候,可以进行分组可视化。先构建两个df,用到上一篇中讲的长宽转换操作:
df <- data.frame(x = c('a','b','c','a','c'),
'2023' = c(1,3,4,4,3),
'2024' = c(3,5,3,8,9),check.names = FALSE)
dm <- reshape2::melt(df,id.vars = 'x',variable.name = 'year',value.name = 'value')
group1 <- aggregate(value~year,dm,mean)#按照year分组,并求均值
group2 <- aggregate(value~year+x,dm,sum)#按照year和x分组,并求和
这里引用《R语言数据可视化之美》书中的图来帮助大家理解:
数据处理的内容暂时就到这里,后面应该会再出专栏详细讲更复杂的数据处理,不过学到这里的数据处理技能已经基本能完成可视化了。
散点图(上)
1.拟合散点图:
散点图是科研中经常用到的可视化图,用来显示两个变量的变化趋势或相关性。
提供三种信息:(1)变量间数量关联(2)判断线性或非线性(3)观察离群值
可视化拟合一般是曲线拟合,用最小二乘法实现。当然可以选择多项式/线性/指数回归模型
我们用鸢尾花数据集做一个不太恰当的拟合,下面是示例:
library(gridExtra)
# 创建一个3行2列的图形布局
par(mfrow=c(2,2))
# 线性拟合
p1 <- ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() + # 添加散点图层
stat_smooth(method = "lm", formula = y ~ x, col = "blue") + # 添加线性拟合曲线
labs(title = "Linear Fit", x = "Petal Length", y = "Petal Width") +
theme_minimal()
# 指数拟合
p2 <- ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
stat_smooth(method = "glm", family = gaussian(link = "log"), formula = y ~ x, col = "red") + # 添加指数拟合曲线
labs(title = "Exponential Fit", x = "Petal Length", y = "Petal Width") +
theme_minimal()
# 对数拟合
p3 <- ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
stat_smooth(method = "glm", family = gaussian(link = "identity"), formula = y ~ log(x), col = "green") + # 添加对数拟合曲线
labs(title = "Logarithmic Fit", x = "Petal Length", y = "Petal Width") +
theme_minimal()
# 多项式拟合
p4 <- ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), col = "purple") + # 添加二次多项式拟合曲线
labs(title = "Polynomial Fit", x = "Petal Length", y = "Petal Width") +
theme_minimal()
# 重置图形设备参数
par(mfrow=c(1,1))
# 将所有图形保存到一个列表中
plots <- list(p1, p2, p3, p4)
# 使用grid.arrange()函数将图形排列成2列2行的格式
do.call(grid.arrange, c(plots, ncol = 2))
这样我们就可以用不同的函数去拟合散点图,当然在实际科研中可以多试试,找到误差(灰色)少的拟合方法。
2.残差图:#此处参考《R语言数据可视化之美》的数据集及方法
残差是观测值与预测值(或拟合值)之间的差异,通过对残差的分析,我们可以评估数据的可信度和识别潜在的周期性或其他干扰因素。在回归分析中,残差遵循正态分布N(0,σ2),而标准化残差(δ*)则遵循标准正态分布N(0,1)。如果一个实验点的标准化残差落在(-2,2)区间之外,我们有95%的置信度认为该点为异常值,因此不应将其纳入回归线的拟合过程中。
我们对于数据集进行完拟合后,可以通过残差分析图可视化
library(ggplot2)
mydata<-read.csv("C:\\Users\\Huzhuocheng\\Desktop\\Residual_Analysis_Data.csv",stringsAsFactors=FALSE)
fit <- lm(y2 ~ x, data = mydata)
mydata$predicted <- predict(fit) # Save the predicted values
mydata$residuals <- residuals(fit) # Save the residual values
mydata$Abs_Residuals<-abs(mydata$residuals) #
ggplot(mydata, aes(x = x, y = y2)) +
geom_point(aes(fill =Abs_Residuals, size = Abs_Residuals),shape=21,colour="black") + # size also mapped
scale_fill_continuous(low = "black", high = "red") +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_point(aes(y = predicted), shape = 1) +
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +# > Color AND size adjustments made here...
guides(fill = guide_legend((title="Rresidual")),
size = guide_legend((title="Rresidual")))+
ylim(c(0,150))+
xlab("X-Axis")+
ylab("Y-Axis")+
theme(text=element_text(size=15,face="plain",color="black"),
axis.title=element_text(size=10,face="plain",color="black"),
axis.text = element_text(size=10,face="plain",color="black"),
legend.position = "right",
legend.title = element_text(size=13,face="plain",color="black"),
legend.text = element_text(size=10,face="plain",color="black"),
legend.background = element_rect(fill=alpha("white",0)))
d<-mydata
fit <- lm(y5 ~ x+I(x^2), data = d)
# Obtain predicted and residual values
d$predicted <- predict(fit) # Save the predicted values
d$residuals0 <- residuals(fit) # Save the residual values
d$Residuals<-abs(d$residuals0 )
ggplot(d, aes(x = x, y = y5)) +
geom_smooth(method = "lm",formula = y ~ x+I(x^2), se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
geom_point(aes(fill =Residuals, size = Residuals),shape=21,colour="black") + # size also mapped
scale_fill_continuous(low = "black", high = "red") +
#scale_color_gradient2(low = "blue", mid = "white", high = "red") +
geom_point(aes(y = predicted), shape = 1) + # Size legend also removed
#ylim(c(0,150))+
xlab("X-Axis")+
ylab("Y-Axis")+
geom_point(aes(y = predicted), shape = 1) +
guides(fill = guide_legend((title="Rresidual")),
size = guide_legend((title="Rresidual")))+
theme(text=element_text(size=15,face="plain",color="black"),
axis.title=element_text(size=10,face="plain",color="black"),
axis.text = element_text(size=10,face="plain",color="black"),
legend.position = "right",
legend.title = element_text(size=13,face="plain",color="black"),
legend.text = element_text(size=10,face="plain",color="black"),
legend.background = element_rect(fill=alpha("white",0)))
更多可视化类型我们下一篇再继续探究,今天的分享就先到这里,大家有问题可以一起讨论学习