启动docker

sudo service docker restart

拉取基础镜像,新建容器

安装biobakery_workflows

用的是pip3安装,通过阿里源,这只包含biobakery_workflows的核心软件和依赖项,不包括kneadata,metaphlan和humann等软件,还需要自己下载安装。

pip3 install biobakery_workflows -i https://mirrors.aliyun.com/pypi/simple/

安装kneaddata

不要用pip,安装不完整。

打包容器成为一个镜像

docker commit e94ffbdc247b 1010:1.0#在本地电脑上

导出镜像包

docker save -o 1010.tar 1010:1.0

在服务器集群上加载镜像包

docker load --input 1010.tar

在不能联网的服务器集群中导入镜像,并以此镜像生成一个容器

docker run -it --storage-opt size=50G -v /public/home/others/renxiaotong/database:/database e4ea5a0c8f8e bash

在这里,我挂载了宿主机上的一个目录,目录里是需要用到的数据库,文件相当大,docker的挂载很好用。

把命令符变成有颜色的:

export PS1="\033[32;1m\]\h:\u \033[31;1m\t\033[0m \[\033[33;1m\]\w\[\033[0m\]\n\[\e[35;1m\]$ \[\e[0m\]"

改写kneaddata环境变量(路径写到前缀)

export KNEADDATA_DB_HUMAN_GENOME=/database/KNEADDATA_DB_HUMAN_GENOME/
export KNEADDATA_DB_HUMAN_TRANSCRIPTOME=/database/KNEADDATA_DB_HUMAN_TRANSCRIPTOME/
export KNEADDATA_DB_RIBOSOMAL_RNA=/database/KNEADDATA_DB_Gmax/Gmax_ZH13_RNA_index

到时候我先用kneaddata去除大豆的序列,再跑流程

export KNEADDATA_DB_HUMAN_GENOME=/database/Gmax_ZH13/Gmax_ZH13_genome_index
export KNEADDATA_DB_HUMAN_TRANSCRIPTOME=/database/Gmax_ZH13/Gmax_ZH13_RNA_index/
export KNEADDATA_DB_RIBOSOMAL_RNA=/database/KNEADDATA_DB_Gmax/Gmax_ZH13_RNA_index

strainphlan的数据库用的是metaphlan的数据库
metaphlan的数据库在/root/miniforge3/lib/python3.10/site-packages/metaphlan/metaphlan_databases/,使用的版本是mpa_vOct22_CHOCOPhlAnSGB_202212

humann用的数据库是v201901_v31

humann_config --update database_folders nucleotide /database/HUMANN_DB_CHOCOPHLAN
humann_config --update database_folders protein /database/HUMANN_DB_UNIREF/
humann_config --update database_folders utility_mapping /database/HUMANN_DB_MAPPING/
humann_config --update run_modes threads 90
#humann --input-format fastq.gz --taxonomic-profile --bypass-prescreen --memory-use maximum -r
#humann --input /database/input/rawdata_gene/output/kneaddata/main/G_12h_3.fastq --output /database/input/rawdata_gene/output/humann/main --o-log /database/input/rawdata_gene/output/humann/main/G_12h_3.log --threads 1 --taxonomic-profile /database/input/rawdata_gene/output/metaphlan/main/G_12h_3_taxonomic_profile.tsv

之前用的是这个

#humann --input /database/input/rawdata_trans/output/kneaddata/main/T_0h_1.fastq.gz --input-format fastq.gz --output /database/input/rawdata_trans/output/humann/main/ --o-log /database/input/rawdata_trans/output/humann/main/T_0h_1.log --threads 90 --taxonomic-profile /database/input/rawdata_gene/output/metaphlan/main/G_0h_1_taxonomic_profile.tsv --bypass-prescreen --memory-use maximum -r

docker system df命令,类似于Linux上的df命令,用于查看Docker的磁盘使用情况

宏基因组&宏转录组联合分析的命令。注:kneaddata要扩大内存跑,要不总是会崩。这些都试一下,看看谁跑的比较好。

biobakery_workflows wmgx_wmtx --input-metagenome ./rawdata_gene/ --input-metatranscriptome ./rawdata_trans/ --input-mapping ./mapping.tsv --output ./output/ --bypass-strain-profiling --threads 40 --local-jobs 6 --remove-intermediate-output --qc-options="-t 40" --qc-options="-p 6" --qc-options="--max-memory 10000m" --quit-early
kneaddata --input1 /database/dome/dome_wts/wts_2.R1.fastq.gz --output /database/dome/demooutput/whole_metatranscriptome_shotgun/kneaddata/main --threads 90 --output-prefix wts_2  --input2 /database/dome/dome_wts/wts_2.R2.fastq.gz --cat-final-output  --reference-db /database/KNEADDATA_DB_HUMAN_GENOME/ --reference-db /database/KNEADDATA_DB_HUMAN_TRANSCRIPTOME/ --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_RNA_index  --serial --run-trf  && gzip -f /database/dome/demooutput/whole_metatranscriptome_shotgun/kneaddata/main/wts_2.fastq
kneaddata --input1 /database/dome/dome_wms/wms_1.R1.fastq.gz --output /database/dome/demooutput/whole_metagenome_shotgun/kneaddata/main --threads 90 --output-prefix wms_1  --input2 /database/dome/dome_wms/wms_1.R2.fastq.gz --cat-final-output  --reference-db /database/KNEADDATA_DB_HUMAN_GENOME/  --serial --run-trf  && gzip -f /database/dome/demooutput/whole_metagenome_shotgun/kneaddata/main/wms_1.fastq
kneaddata -i1 /database/input/rawdata_gene/G_0h_1.R1.fastq.gz -i2 /database/input/rawdata_gene/G_0h_1.R2.fastq.gz -o /database/input/output/Gmcleaned/ --output-prefix G_0h_1 -t 60 -p 20 --cat-final-output  --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_genome_index/Gmax_ZH13 --max-memory 1000m
kneaddata -i1 /database/input/rawdata_gene/G_0h_2.R1.fastq.gz -i2 /database/input/rawdata_gene/G_0h_2.R2.fastq.gz -o /database/input/output/Gmcleaned/ --output-prefix G_0h_2 -t 60 -p 20 --cat-final-output  --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_genome_index/Gmax_ZH13 --reference-db /database/KNEADDATA_DB_HUMAN_GENOME/hg37dec_v0.1 --serial --max-memory 1000m

宏基因组的kneaddata

kneaddata --input1 /database/input/rawdata_gene/G_0h_1.R1.fastq.gz --output /database/input/rawdata_gene/output/kneaddata/main --threads 90 -p 30 --output-prefix G_0h_1 --input2 /database/input/rawdata_gene/G_0h_1.R2.fastq.gz --cat-final-output  --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_genome_index/ --serial --run-trf  && gzip -f /database/input/rawdata_gene/output/kneaddata/main/G_0h_1.fastq

宏转录组的kneaddata,以下几个命令行谁跑的好就用谁

kneaddata --input1 /database/input/rawdata_trans/T_0h_1.R1.fastq.gz --output /database/input/rawdata_trans/output/kneaddata/main --threads 90 -p 30 --output-prefix T_0h_1 --input2 /database/input/rawdata_trans/T_0h_1.R2.fastq.gz --cat-final-output  --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_RNA_index/ --reference-db /database/KNEADDATA_DB_Gmax/RIBOSOMAL_RNA --serial --run-trf  && gzip -f /database/input/rawdata_trans/output/kneaddata/main/T_0h_1.fastq
kneaddata --input1 /database/input/rawdata_trans/T_0h_1.R1.fastq.gz --output /database/input/rawdata_trans/output/kneaddata/main --threads 90 --output-prefix T_0h_1 --input2 /database/input/rawdata_trans/T_0h_1.R2.fastq.gz --cat-final-output --reference-db /database/KNEADDATA_DB_Gmax/RIBOSOMAL_RNA --run-trf  && gzip -f /database/input/rawdata_trans/output/kneaddata/main/T_0h_1.fastq #最后用的是这个,只去除了rRNA
kneaddata --input1 /database/input/rawdata_gene/G_0h_1.R1.fastq.gz --input2 /database/input/rawdata_gene/G_0h_1.R2.fastq.gz --output /database/input/output/Gmcleaned --reference-db /database/KNEADDATA_DB_Gmax/Gmax_ZH13_genome_index/Gmax_ZH13 --threads 90 -p 40 --output-prefix G_0h_1 --cat-final-output --max-memory 5000m  && gzip -f /database/input/output/Gmcleaned/G_0h_1.fastq 

Decompressing gzipped file ...

Decompressing gzipped file ...

Reformatting file sequence identifiers ...

Reformatting file sequence identifiers ...

Reordering read identifiers ...
Reordering read identifiers ...

Initial number of reads ( /database/input/output/Gmcleaned/reordered_a2yncb53_reformatted_identifiersn615v0vs_decompressed_37iyp71a_G_0h_2 ): 24742271.0
Initial number of reads ( /database/input/output/Gmcleaned/reordered_jp8n6lfk_reformatted_identifiersnsohejzg_decompressed_99bgd4vh_G_0h_2 ): 24742271.0
Running fastqc ... 
Running Trimmomatic ... 
Total reads after trimming ( /database/input/output/Gmcleaned/G_0h_2.trimmed.1.fastq ): 23198926.0
Total reads after trimming ( /database/input/output/Gmcleaned/G_0h_2.trimmed.2.fastq ): 23198926.0
Total reads after trimming ( /database/input/output/Gmcleaned/G_0h_2.trimmed.single.1.fastq ): 727347.0
Total reads after trimming ( /database/input/output/Gmcleaned/G_0h_2.trimmed.single.2.fastq ): 537663.0
Running trf ... 
Running trf ...
metaphlan /database/dome/demooutput/whole_metagenome_shotgun/kneaddata/main/wms_2.fastq.gz --input_type fastq --output_file /database/dome/demooutput/whole_metagenome_shotgun/metaphlan/main/wms_2_taxonomic_profile.tsv --samout /database/dome/demooutput/whole_metagenome_shotgun/metaphlan/main/wms_2_bowtie2.sam --nproc 90 --no_map --tmp_dir /database/dome/demooutput/whole_metagenome_shotgun/metaphlan/main

HUMAnN2的输入文件为宏组学序列,即可以是DNA层面,也可以是RNA层面;第一步采用标记基因检索已知物种;第二步层级检索己知物种的泛基因组;第三步翻译末知物种序列比对至蛋白数据库;最后计算基因家族和通路的丰度,包括群体和物种层面。

宏转录组学,您会推荐 HUMAnN2 进行总体评估,推荐 PanPhlAn 进行菌株特异性转录组学分析

2.2 HUMAnN计算物种和功能组成

  • 物种组成调用MetaPhlAn4
  • 输入文件:temp/concat/*.fq 每个样品质控后双端合并后的fastq序列
  • 输出文件:temp/humann3/ 目录下
  • C1_pathabundance.tsv
  • C1_pathcoverage.tsv
  • C1_genefamilies.tsv
  • 整合后的输出:
  • result/metaphlan4/taxonomy.tsv 物种丰度表
  • result/metaphlan4/taxonomy.spf 物种丰度表(用于stamp分析)
  • result/humann3/pathabundance_relab_unstratified.tsv 通路丰度表
  • result/humann3/pathabundance_relab_stratified.tsv 通路物种组成丰度表
  • stratified(每个菌对此功能通路组成的贡献)和unstratified(功能组成)

查看宏基因组和宏转录组联合分析的帮助
wmgx_wmtx.py -h

merge_metaphlan_tables.py在这

whereis merge_metaphlan_tables.py
merge_metaphlan_tables: /root/miniforge3/bin/merge_metaphlan_tables.py