SRA toolkit

原创

emanlee 2023-11-02 10:42:43 ©著作权

©著作权归作者所有：来自51CTO博客作者emanlee的原创作品，请联系作者获取转载授权，否则将追究法律责任

使用SRAdb V2获取SRA数据

安装SRAdbV2包
install.packages('BiocManager')
BiocManager::install('seandavi/SRAdbV2')

使用SRAdbV2 首先需要创建一个 R6类-Omicidx

library(SRAdbV2)
oidx = Omicidx$new()

创建好Omicidx实例后，就可以使用oidx$search()来进行数据检索

query=paste(
  paste0('sample_taxon_id:', 10116),
  'AND experiment_library_strategy:"rna seq"',
  'AND experiment_library_source:transcriptomic',
  'AND experiment_platform:illumina')
z = oidx$search(q=query,entity='full',size=100L)

其中，entity 参数是指可以通过API获得的SRA实体类型， size 参数指查询结果返回的记录数

由于有时候返回的结果集数据量很会大，所以我们可以使用 Scroller 来对结果进行检索提炼
s = z$scroll()
s
s$count

s$count 可以让我们简单看一下返回数据的条数有多少

Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: api-omicidx.cancerdatasci.org

1.1 Scroller提供两种方法来存取数据
第一种方法，是把所有的查询结果都加载到R的内存中，但是这会很慢
res = s$collate(limit = 1000)
head(res)
然后使用 reset() 重新设置Scroller

s$reset()
s

第二种方法是，使用 yield 方法来迭代取数据

j = 0
## fetch only 500 records, but
## `yield` will return NULL
## after ALL records have been fetched
while(s$fetched < 500) {
    res = s$yield()
    # do something interesting with `res` here if you like
    j = j + 1
    message(sprintf('total of %d fetched records, loop iteration # %d', s$fetched, j))
}

如果没有获取到完整的数据集，Scroller对象的has_next()方法会报出 TRUE
使用 reset() 函数可以将光标移动到数据集的开头

2. Query syntax
见这里
https://bioconductor.github.io/BiocWorkshops/public-data-resources-and-bioconductor.html#query-syntax
3. Using the raw API without R/Bioconductor
可以不通过R/Bioconductor，而是用原生API获取数据
SRAdbV2封装了web的API，因此可以通过web API访问其中数据

sra_browse_API()

基于web的API为实验数据查询提供了一个有用的接口，基于json的可以用

sra_get_swagger_json_url()

===========================================

安装 sra toolkit

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

wget --output-document sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz

tar -vxzf sratoolkit.current-centos_linux64.tar.gz

export PATH=$PATH:$PWD/sratoolkit.2.10.9-centos_linux64/bin

===========================================

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

Tool: prefetch

Usage:

prefetch [options] <path/SRA file | path/kart file> [<path/file> ...]

prefetch [options] <SRA accession>

prefetch [options] --list <kart_file>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data transfer:
-f	\|	--force <value>	Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).
		--transport <value>	Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.
-l	\|	--list	List the contents of a kart file.
-s	\|	--list-sizes	List the content of kart file with target file sizes.
-N	\|	--min-size <size>	Minimum file size to download in KB (inclusive).
-X	\|	--max-size <size>	Maximum file size to download in KB (exclusive). Default: 20G.
-o	\|	--order <value>	Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.
-a	\|	--ascp-path <ascp-binary\|private-key-file>	Path to ascp program and private key file (asperaweb_id_dsa.openssh).
-p	\|	--progress <value>	Time period in minutes to display download progress (0: no progress). Default: 1.
		--option-file <file>	Read more options and parameters from the file.

Use examples:

prefetch cart_0.krt
 
prefetch -l cart_0.krt
 
prefetch -X 200G cart_0.krt
 
prefetch -o kart cart_0.krt
 
prefetch -a "/opt/aspera/bin/ascp|/opt/aspera/etc/asperaweb_id_dsa.openssh" SRR390728
 
prefetch -t ascp -a "/opt/aspera/bin/ascp|/opt/aspera/bin/asperaweb_id_dsa.openssh" --option-file file.txt
 
prefetch ~/Downloads/SRR390728.sra
 
prefetch -c SRR390728

===========================================

A non-R solution is to use the SRA toolkit prefetch command on a list of SRA identifiers.

First you need the file list. You can batch download it. In your case, go to https://www.ncbi.nlm.nih.gov/sra?term=SRP026197 Top-right, click to "Send To", "File", "Accession List".

Once you have it saved in a file (default is SraAccList.txt) you can use the command (tested in SRA toolkit 2.9.0):