转录组测序结果R语言分析转录组测序unigene

转载

mob6454cc62b754 2024-04-11 08:44:05

文章标签 转录组测序结果R语言分析 python ide 数据库数据 文章分类 R语言后端开发

转录组unigene表达结果，依据亚细胞定位再分析（无参考RNA-seq）

我们有一个几年前无参考转录组分析unigene的表达量结果，该转录组实验有两个处理（0 hour heat 和 6 hour heat treatment），想要从中查看铜相关基因在两个处理下的累积表达量情况，表达量分类是依据预测的亚细胞定位建立的。方便记录，具体操作步骤如下：

1. balstx寻找与已知铜蛋白比对率高的unigene序列

（1）已知两个文件

unigene.fasta （unigene DNA fasta）
RNA_unigene_res.txt （unigene gene count table）

（2）依据文献提取拟南芥相关蛋白序列

依据文章中检测到的拟南芥铜相关基因（附件中），使用拟南芥蛋白序列Arabidopsis_thaliana.TAIR10.pep.all.fa，提取对应蛋白序列：
The PCY-SAG14 phytocyanin module regulated by PIFs and miR408 promotes dark-induced leaf senescence in Arabidopsis

据此得到蛋白序列：
f07.tair10.copper.genelist.fasta

（3）运行blastx

使用BLAST 2.2.29+软件（BLAST+），首先用已知蛋白序列建库，然后用unigene.fasta比对：

makeblastdb -in ./f07.tair10.copper.genelist.fasta  -dbtype prot -out unigenedb -parse_seqids
blastx -query ./unigene.fasta -db unigenedb -out unigene.blastx.Tair10.res.blast\
 -evalue 1e-10  -num_threads 50 -outfmt 6

输出结果共有4713行：

4713 unigene.blastx.Tair10.res.blast

以identity 65%为阈值，只保留identity大于等于65%的比对结果，得到文件（111个基因）：

111 f08.iden65.unigene.txt

根据unigene.fasta提取得到111个基因的DNA序列：

f09.iden65.unigene.fasta

2. 翻译基因序列为蛋白序列

使用EMBOSS工具包transeq软件，将DNA序列转换为Protein序列。根据6个开放阅读框，111个基因序列可以翻译得到666个蛋白序列：

grep ">" -c f17.111gene.pep.fa 
666

3. 蛋白序列注释Pfam数据库

使用Interproscan软件注释Pfam数据库。

nohup /home/debian/interproscan_data/interproscan-5.48-83.0/interproscan.sh -i \
   f17.111gene.pep.fa -appl Pfam -cpu 20 \
   -o f17.111.pfam.res.txt -f TSV -dp -iprlookup -pa &

4. 筛选铜蛋白（每个基因保留一个蛋白序列）并预测亚细胞定位

根据Interproscan结果，删除一些不相关的基因，最终确定下面的Pfam结果，共有98个蛋白，每个蛋白对应着一个单独的基因：

f20.correct.pfam.txt ## 筛选后的Pfam结果
f21.98.ensure_domain.98pep.txt ## 蛋白名列表

c102995_g1_2
c10324_g1_5
c103989_g1_5
c105172_g1_6

提取蛋白序列f22.98pep.fa，导入WoLF PSORT（https://wolfpsort.hgc.jp/）网站预测亚细胞定位。结果如下：

c102995_g1_2 details cyto: 7, nucl: 2, extr: 2, vacu: 2, plas: 1
c10324_g1_5 details cyto: 12, nucl: 2
c103989_g1_5 details chlo: 8, nucl: 3, extr: 2, cyto: 1
c105172_g1_6 details cyto: 12, nucl: 1, golg: 1

提取最可靠的亚细胞定位信息作为该蛋白预测结果：

c102995_g1	Cytoplasmic
c10324_g1	Cytoplasmic
c103989_g1	Chloroplast

5. 计算亚细胞定位中基因的累积表达量

筛选出上面98个基因的表达量数据，根据亚细胞定位结果进行排序，表达量结果如下：

## 文件名
f26.simple.98gene.exp.txt 

c102995_g1	0.275821379	0
c10324_g1	2.120374761	0.353179332
c103989_g1	0.277707906	1.216129624
c105172_g1	0.275821379	0

使用EXCEL排序等求和等操作可以很快得到对应亚细胞定位的累积表达量，结果如下：

f27.98gene.exp.sum.txt

loc	0h	6h
Chloroplast	3494.101128	7961.049997
Cytoplasmic	41.84050748	62.01538887
Extracellular	0	4.741760757
Mitochondrial	22.07374703	21.84347317
Nuclear	5673.891125	5272.025286
Plasmembrane	44384.60508	30100.41777

6. 可视化展示

基于上面的数据，使用R语言可视化展示：

rm(list=ls())
require(RColorBrewer)
setwd("~/Desktop/pfam_111/")

## import data
total=read.table("f27.98gene.exp.sum.txt",stringsAsFactors = F,header=F,skip=1)

## calculate ratios
d0=total
d0$V4=d0$V2/sum(d0$V2) ## 0 hour
d0$V5=d0$V3/sum(d0$V3) ## 6 hour

## define column names
colnames(d0)=c("Subloc","WT_zero","WT_six","WT_zero_ratio","WT_six_ratio")

##　get ratio labels
label0 <- paste(d0$Subloc, "(", round(d0$WT_zero_ratio*100,2), "% )")
label0
label6 <- paste(d0$Subloc, "(", round(d0$WT_six_ratio*100,2), "% )")
label6

## plot Pie charts
pdf(file="F01.0h_exp_pie.pdf")
pie(d0$WT_zero,border="white", col=brewer.pal(6, "Set3"), labels = label0)
dev.off()

pdf(file="F01.6h_exp_pie.pdf")
pie(d0$WT_six,border="white", col=brewer.pal(6, "Set3"), labels = label6)
dev.off()

绘制完成使用Adobe Illustrator修改合并图片：

转录组测序结果R语言分析转录组测序unigene_转录组测序结果R语言分析

从图中可以看出，处理后，叶绿体定位的蛋白累积表达量变多，而细胞质膜中定位的基因表达量降低。以上，至此结束。

参考：
The PCY-SAG14 phytocyanin module regulated by PIFs and miR408 promotes dark-induced leaf senescence in Arabidopsis

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：docker gpus和runtime的区别 docker run —gpus

下一篇：stun java开源客户端 java开源库

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯