宏基因组数据分析思路

转载

mob64ca1411e411 2024-09-10 19:56:40

文章标签 宏基因组数据分析思路数据库 python 开发语言 github 文章分类 数据分析人工智能

写在前面

mobileOG-db数据库github地址：
https://github.com/clb21565/mobileOG-db

文章发表于2022年8月份。题目为：mobileOG-db: a Manually Curated Database of Protein Families Mediating the Life Cycle of Bacterial Mobile Genetic Elements；这个库更适合长序列注释，也就是contig更加适合这个库，我也对比了一下18年的那个可移动元件库，这个确实丰度的多。

DOI: https://doi.org/10.1128/aem.00991-22
文章网络online地址：
https://journals.asm.org/doi/10.1128/aem.00991-22
一些有用的数据库备份：
https://tehub.org/resources/repeat_databases

数据库和工具的使用

数据库地址：
https://mobileogdb.flsi.cloud.vt.edu/ （点击菜单栏下载完整数据库，200多M）
这里的基因数量超过了80万，远比18年MobileGeneticElementDatabase数据库内容多。

安装专用的注释工具

首先建立conda虚拟环境，其次装依赖工具：

conda create -n mobileOG-db python=3.6.15
conda activate mobileOG-db
conda install -c conda-forge biopython
conda install -c bioconda prodigal
conda install -c bioconda diamond
conda install -c anaconda pandas

使用diamond构建索引

diamond makedb --in ./meta.mini.db/mobileOG-db/mobileOG-db_beatrix-1.6.All.faa -d ./meta.mini.db/mobileOG-db/mobileOG-db-beatrix-1.X.dmnd

添加可执行权限(最终没有使用这个工具)

chmod +x meta.mini.db/mobileOG-db/mobileOG-db-main/mobileOG-pl/mobileOGs-pl-kyanite.sh

PATH=$PATH:./meta.mini.db/mobileOG-db/mobileOG-db-main/mobileOG-pl/

直接使用diamond进行序列比对：

mkdir ./data/mobileOG/
diamond blastp -q ./data/allgenecalled.faa --db meta.mini.db/mobileOG-db/mobileOG-db-beatrix-1.X.dmnd --outfmt 6 stitle qtitle pident bitscore slen evalue qlen sstart send qstart qend -k 15 -o ./data/mobileOG/mobileOG.tsv -e 1e-20 --query-cover 90 --id 90

#--作者写的py程序进行整理
python ./meta.mini.db/mobileOG-db/mobileOG-db-main/mobileOG-pl/mobileOGs-pl-kyanite.py --o ./data/mobileOG/mobileOG --i ./data/mobileOG/mobileOG.tsv -m ./meta.mini.db/mobileOG-db/mobileOG-db-beatrix-1.6-All.csv

修改注释代码，更加适合后续处理流程

修改一下吧，作者提供的工具与分析习惯不同。

time diamond blastp --db time diamond blastp --db meta.mini.db/mobileOG-db/mobileOG-db-beatrix-1.X.dmnd --query ./data/allgenecalled.faa      --outfmt 6 --threads 10 --max-target-seqs 1 --quiet -e 1e-5 --sensitive     --out data/mobileOG/gene_diamond.f6

# 提取基因对应基因家族
cut -f 1,2 data/mobileOG/gene_diamond.f6 | uniq | \
      sed '1 i Name\tResGeneID' > data/mobileOG/gene_mobileOG.list

使用R语言整理后续构造phyloseq对象

#---mobileOGdb#------
library(tidyverse)
dat = read.delim("./data/mobileOG/gene_mobileOG.list")
dim(dat)
dat$ID = sapply(strsplit(dat$ResGeneID, "[|]"), `[`, 1)
head(dat)

#-数据库注释信息导入
dat$ResGeneID = sapply(strsplit(dat$ResGeneID , "_1"), `[`, 1)
db1 = read.csv("./meta.mini.db/mobileOG-db/mobileOG-db-beatrix-1.6-All.csv",
               header = TRUE)
head(db1)
colnames(db1)[4] = "name.gene"
db1$Amino.Acid.Sequence = NULL

dat2 = dat %>% left_join(db1,by = c("ID" ="mobileOG.Entry.Name"))
head(dat2)
dim(dat2)
arpath = "./resulPlot.meta/mobileOG/"
dir.create(arpath)
write.csv(dat3,paste(arpath,"/mobileOG_all_info.csv",sep = ""),quote = F)

library(data.table)
count = fread("./data/salmon/gene.count",header = T);count
colnames(count)[-1] =gsub("A","",colnames(count)[-1])
colnames(count)[-1] = paste("A",colnames(count)[-1],sep = "")

library(tidyfst)
id = dat2$Name %>% unique()
otu = count %>% filter_dt(Name %in% id) %>% as.data.frame() %>% column_to_rownames("Name")
head(otu)

write.csv(otu,paste(arpath,"/mobileOG_count_tab.csv",sep = ""),quote = F)

head(dat2)
dat2$Name %>% 
  # unique() %>% 
  length()

tax = dat2%>% distinct(Name,.keep_all = TRUE)  %>%
  column_to_rownames("Name")
library(phyloseq)
ps = phyloseq(
  otu_table(as.matrix(otu),taxa_are_rows = TRUE),
  tax_table(as.matrix(tax))
)

map = read.delim("./data/metaGroup.txt")
head(map)
row.names(map) = map$ID
sample_data(ps) = map
pssaveRDS(ps,"./data/ps_mobileOG.rds")根际互作生物学研究室 简介根际互作生物学研究室是沈其荣院士土壤微生物与有机肥团队下的一个关注于根际互作的研究小组。本小组由袁军副教授带领，主要关注：1.植物和微生物互作在抗病过程中的作用；2 环境微生物大数据整合研究；3 环境代谢组及其与微生物过程研究体系开发和应用。团队在过去三年中在 Microbiome，ISME J，Fundamental Research，iMeta， PCE，SBB，Horticulture Research，SEL，BMC plant biology等期刊上发表了多篇文章。欢迎关注 微生信生物 公众号对本研究小组进行了解。
撰写：文涛
修改：文涛

审核：袁军
团队工作及其成果 （点击查看）