python差异表达基因筛选差异表达基因验证

转载

flybirdfly 2023-12-06 19:28:12

文章标签 python差异表达基因筛选方差数据数据转换 文章分类 Python 后端开发

DESeq2的适用性

分析来自RNA-seq的计数数据，基因任务是检测差异表达基因。
也适用于其他分析：ChIP-Seq、HiC、shRNA筛选。

快速开始

dds = DESeqDataFromMatrix(countData = cts,
						  colData = colData,
						  design = ~batch + condition)

dds = DESeq(dds)
resultsNames(dds)  #lists the coefficients
res = results(dds, contrast=c("condition","treated","untreated"))
res = lfcShrink(dds,coef = "condition_trt_vs_untrt",type = "apeglm")

输入数据

输入数据需要归一化计数吗
不需要 输入矩阵的值为非标准计数或测序读数即可，DESeq2模型会对文库大小进行校正。
cts和colData的格式
cts列名为SampleName，行名为Gene。
colData列记录Sample的分组信息（不限一列），行名为SampleName。
预过滤
若Gene在所用Samples中的计数小于10则过滤掉。
keep = rowSums(counts(dds)) >= 10dds = dds[keep,]
因子水平
通过设置factor的参考水平来告诉DESeq2函数与哪个水平进行比较。

1. dds$condtion = factor(dds$condition,levels = c("untreated","treated"))
2. dds$condtion = relevel(dds$condtion,ref = "untreated")

差异表达分析

results函数
results(dds)生成结果表。
lfcShrink函数
logFoldChange缩小有助于基因的可视化和比较。可指定apeglm方法将dds对象传递给lfcShrink来缩小。
lfcShrink(dds,coef = "condition_treated_vs_untreated",type = "apeglm")
P值和调整后的P值
按P值对结果表进行排序。
使用summary(res)总结差异分析结果。
有多少个调整p值小于0.1？sum(res$padj < 0.1,na.rm = TRUE) 使用results函数设置padj cutoff results(dds,alpha = 0.05)]
什么时候能将P值设为NA

gene在所有样本中的计数均为0。
某行包含一个极端计数异常的样本。

多因素设置
colData矩阵描述分组分组。
可对dds对象使用colData函数查看分组因子水平。
多因素设置在design参数增加比较因素。

DESeq2可视化

MA plot
图中x轴为baseMean，y轴为logFoldChange，若padj<0.1点标为红色。

plotMA(res,ylim = c(-2,2))

缩小LFC可消除low readcount基因中与LFC变化相关的噪声。
resLFC = lfcShrink(res,coef = "condition_treated_vs_untreated",type = "apeglm")
plotMA(resLFC,ylim = c(-2,2))

可使用功能识别通过单击图来交互来检测单个基因行数：
identify(res$baseMean,res$log2FoldChange)

PlotCounts
对单个基因在各组中的读数进行可视化。计数经归一化。
plotCounts(dds,gene = which.min(res$padj),intgroup = "condition",returnData = False)

python差异表达基因筛选差异表达基因验证_数据转换

将returnData设置为TRUE，使用ggplot2绘制plotCounts。

d = plotCounts(dds,gene = which.min(res$padj),intgroup = "condition",returnData = TRUE)
ggplot(d,aes(x = condition,y = count)) +
	geom_point(position = position_jitter(w = 0.1,h = 0)) +
	scale_y_log10()

数据转换和可视化

计数数据转换
变换的目的是消除方差对均值的依赖性（低均值，高方差）。在转换之后，具有相同均值的基因没有完全相同的标准差，但是整个实验范围内的趋势趋于平缓。
VST：方差稳定变换。大样本推荐选择。
rlog：正规对数。
盲散估计
VST和rlog函数有一个blind参数。当下游分析时将blind设置为FALSE。
提取转换后的值

vsd = vst(dds,blind = FALSE)
rld = rlog(dds,blind = FALSE)
assay(vsd)
assay(rld)

数据质量评估

计数矩阵热图

library(pheatmap)
select = order(rowMeans(counts(dds,normalized = TRUE)),decreasing = TRUE)[1:20]
df = as.data.frame(colData(dds)[,c("condtion","type")])
ntd <- normTransform(dds)  #this gives log2(n+1)
pheatmap(assay(ntd)[select,],annotation_col = df,cluster_rows = FALSE,
	     cluster_cols = FALSE,show_rownames = FALSE)

样本及样本间的相关性
获得样本到样本的距离
sampleDist = dist(t(assay(vsd))) 定义距离矩阵，行名为分组信息，列名为空

sampleDistMatrix = as.matrix(sampleDist)
rownames(sampleDistMatrix) = paste(vsd$condition,vsd$type,sep='_')
colnames(sampleDistMatrix) = NULL
pheatmap(sampleDistMatrix,
	clustering_distance_rows = sampleDist,
	clustering_distance_cols = sampleDist)