Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Download sequencing data and metadata from GSA, SRA, ENA, and DDBJ databases.

License

NotificationsYou must be signed in to change notification settings

BioOmics/iSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anaconda-Server BadgeAnaconda-Server BadgeAnaconda-Server BadgeAnaconda-Server BadgeAnaconda-Server Badge

iSeq: Anintegrated tool to fetch publicSequencing data

Cite us: Haoyu Chao, Zhuojin Li, Dijun Chen, Ming Chen, iSeq: An integrated tool to fetch public sequencing data,Bioinformatics, 2024;, btae641,https://doi.org/10.1093/bioinformatics/btae641 [PMID:39447029]

Description

iSeq is a Bash script that allows you to download sequencing data and metadata fromGSA,SRA,ENA, andDDBJ databases. SeeDetail Pipeline for iSeq. Here is the basic pipeline of iSeq:

iSeq-pipeline

Important

To use iSeq, Your system must beconnected to the network andsupport FTP, HTTP, and HTTPS protocols.

Update Notes:

2025.03.14

  • Fixed the issue mentioned in#26. The cause was that the data waspaired-end but had onlyone link, such asSRR23680070.
More Updates

2025.03.11

  • Input file can contain accessions from different databases.
  • -p and-a can be used simultaneously , with-a taking priority.
  • Fixed some bugs when retrying download data from GSA.

2024.12.26

  • Fixed the bugs mentioned in#16,#17 (2024.12.16) and#19 (2024.12.26).

2024.11.21

  • Dependency update for aspera-cli:The version requirement for aspera-cli has been updated fromaspera-cli toaspera-cli=4.14.0.

2024.10.23

  • New-s,--speed option to set the download speed limit (MB/s) (default: 1000 MB/s). Such asiseq -i SRR7706354 -s 10
  • Dependency update for sra-tools:The version requirement for sra-tools has been updated fromsra-tools=2.11 tosra-tools>=2.11.0.

2024.09.14

  • New-e option for merging FASTQ files:Added a-e option to merge multiple FASTQ files into a single file for eachExperiment (-e ex),Sample (-e sa), orStudy (-e st).

  • New-i option for input:iSeq can now accept afile containing multiple accession numbers as input by-i fileName.

  • API change for GSA metadata download:The API endpoint has been updated fromgetRunInfo togetRunInfoByCra for downloading GSA metadata.

  • Save result to personal directory:The output results will now be saved in the user's personal directory by-o option.

  • Updated regex for SAMC matching:The matching pattern for SAMC has been changed fromSAMC[A-Z]?[0-9]+ toSAMC[0-9]+.

  • Fix some bugs

Features

  • Multiple Database Support: Supports multiple bioinformatics databases (GSA/SRA/ENA/DDBJ/GEO).
  • Multiple Input Formats: Supports multiple accessions (Project, Study, Sample, Experiment, or Run accession).
  • Metadata Download: Supports download sample metadata for each accession.
More features
  • File Format Selection: Users can choose to directly download gzip-formatted FASTQ files or download SRA files and convert them to FASTQ format.
  • Multi-threading Support: Supports the use of multi-threading to accelerate the conversion of SRA to FASTQ files or the compression of FASTQ files.
  • File Merging: For experiment-level accession, the script can merge multiple FASTQ files into one.
  • Parallel Download: Supports parallel download connections, allowing the specification of the number of connections to speed up download speeds.
  • Support for Aspera High-speed Download: For GSA/ENA databases, the script supports high-speed data transfer using Aspera.
  • Automatic Retry Mechanism: If a download or verification fails, the script will automatically retry until a set number of attempts have been reached.
  • Automated File Verification: After the download is complete, the script will automatically verify the integrity of the files, including checking file sizes and MD5 checksums.
  • Error Handling: The script provides error messages and suggestions for solutions when encountering errors.

Installation

1.iSeq can be installed by conda easily

conda install bioconda::iseq

2. The latest version of iSeq can also be installed from source, seeINSTALL

# Use the following command to check whether dependent software is installediseq --version

Example (See more)

  1. Download all Run sequencing data and metadata associated with an accession.
iseq -i PRJNA211801

e01

  1. Batch download by Aspera with-a to directly download gzip-formatted FASTQ files with-g.
iseq -i SRR_Acc_List.txt -a -g

e13

Usage (中文教程✨)

$ iseq --helpUsage:  iseq -i accession [options]Required option:  -i, --input     [text|file]   Single accession or a file containing multiple accessions.                                Note: Only one accession per line in the fileOptional options:  -m, --metadata                Skip the sequencing data downloads and only fetch the metadata for the accession.  -g, --gzip                    Download FASTQ files in gzip format directly (*.fastq.gz).                                Note: if *.fastq.gz files are not available, SRA files will be downloaded and converted to *.fastq.gz files.  -q, --fastq                   Convert SRA files to FASTQ format.  -t, --threads   int           The number of threads to use for converting SRA to FASTQ files or compressing FASTQ files (default: 8).  -e, --merge     [ex|sa|st]    Merge multiple fastq files into one fastq file for each Experiment, Sample or Study.  -d, --database  [ena|sra]     Specify the database to download SRA sequencing data (default: ena).                                Note: new SRA files may not be available in the ENA database, even if you specify "ena".  -p, --parallel  int           Download sequencing data in parallel, the number of connections needs to be specified, such as -p 10.                                Note: breakpoint continuation cannot be shared between different numbers of connections.  -a, --aspera                  Use Aspera to download sequencing data, only support GSA/ENA database.  -s, --speed     int           Download speed limit (MB/s) (default: 1000 MB/s).  -o, --output    text          The output directory. If not exists, it will be created (default: current directory).  -h, --help                    Show the help information.  -v, --version                 Show the script version.

1.-i,--input

Input the accession you want to download, You also can input a file containing multiple accessions (Only one accession per line in the file).

iseq -i PRJNA211801

Firstly,iSeq will retrieve the metadata of the accession, then proceed to download each Run contained within.

Currentlysupports 6 accession formats from the following5 databases, with supported accession prefixes as follows:

DatabasesBioProjectStudyBioSampleSampleExperimentRun
GSAPRJCCRASAMC\CRXCRR
SRAPRJNASRPSAMNSRSSRXSRR
ENAPRJEBERPSAMEERSERXERR
DDBJPRJDBDRPSAMDDRSDRXDRR
GEOGSE\GSM\\\

Additionally, for the two data formats (GSE/GSM) from the GEO database, it will directly retrieve the associatedPRJNA/SAMN, then proceed to obtain the contained Runs and download the sequencing data. Therefore, essentially, it still downloads sequencing data from the SRA database.

Here are some examples:

Accession TypePrefixesExample
BioProjectPRJEB, PRJNA, PRJDB, PRJC, GSEPRJEB42779, PRJNA480016, PRJDB14838, PRJCA000613, GSE122139
StudyERP, DRP, SRP, CRAERP126685, DRP009283, SRP158268, CRA000553
BioSampleSAMD, SAME, SAMN, SAMCSAMD00258402, SAMEA7997453, SAMN06479985, SAMC017083
SampleERS, DRS, SRS, GSMERS5684710, DRS259711, SRS2024210, GSM7417667
ExperimentERX, DRX, SRX, CRXERX5050800, DRX406443, SRX4563689, CRX020217
RunERR, DRR, SRR, CRRERR5260405, DRR421224, SRR7706354, CRR311377

In summary, regardless of the data format of your accession among the six options, it will eventually download andcheck the MD5 value of each contained Run. If the MD5 value does not match that in the public database, it will attempt a maximum ofthree rounds of re-downloading. If successful after three attempts of downloading and verification, the file name will be stored insuccess.log; otherwise, if the download fails, the file name will be stored infail.log.

2.-m,--metadata

Download only the sample information of the accession and skip the download of sequencing data.

iseq -i PRJNA211801 -miseq -i CRR343031 -m

Therefore, regardless of whether the-m parameter is used or not, the sample information of the accession will be obtained. If metadata cannot be retrieved, theiSeq program will exit without proceeding to the subsequent download.

Note

Note 1: If the retrieved accession is in theSRA/ENA/DDBJ/GEO databases,iSeq will first search in the ENA database. If sample information can be retrieved, it will download metadata inTSV format via theENA API, typically containing 191 columns. However, some recently released data in the SRA database may not be promptly synchronized to the ENA database. Therefore, if metadata cannot be obtained from the ENA database,iSeq will directly download metadata inCSV format via theSRA Database Backend, typically containing 30 columns. To maintain consistency with the TSV format, it will be converted to TSV format usingsed -i 's/,/\t/g'. However, if a single field contains a comma, it may cause column disorder. Ultimately, you will obtain sample information named${accession}.metadata.tsv.

Note

Note 2: If the retrieved accession is in theGSA database,iSeq will obtain sample information via GSA'sgetRunInfo interface, downloading metadata inCSV format, typically containing 25 columns. The metadata obtained above will be saved as${accession}.metadata.csv. To supplement more detailed metadata information, iSeq will automatically obtain metadata information for the Project to which the accession belongs via GSA'sexportExcelFile interface, downloading metadata inXLSX format, typically with 3 sheets:Sample,Experiment,Run. The final metadata information will be saved as${accession}.metadata.xlsx. In summary, you will ultimately obtain sample information named${accession}.metadata.csv andCRA*.metadata.xlsx.

3.-g,--gzip

Directly download FASTQ files ingzip format. If direct download is not possible, SRA files will be downloaded and converted to gzip format using multi-threading for decomposition and compression.

iseq -i SRR1178105 -g

Since the majority of data formats stored directly in theGSA database are in gzip format, if the accession being searched for is from the GSA database, whether the-g parameter is used or not, you can directly download FASTQ files in gzip format.

If the accession is from theSRA/ENA/DDBJ/GEO databases,iSeq will first attempt to access the ENA database. If it can directly download FASTQ files in gzip format, it will do so; otherwise, it will download SRA files and convert them to FASTQ format using thefasterq-dump tool, then compress the FASTQ files using thepigz tool, ultimately obtaining FASTQ files in gzip format.

Tip

parallel-fastq-dump can also convert SRA to gzip-compressed FASTQ files, typically2-3 times faster thanfasterq-dump + pigz. However, consideringIO limitations,iSeq currently does not supportparallel-fastq-dump.

4.-q,--fastq

After downloading the SRA files, they will be decomposed into multipleuncompressed FASTQ files.

iseq -i SRR1178105 -q

This parameter is only effective when the accession is from theSRA/ENA/DDBJ/GEO databases and the downloaded files areSRA files. After downloading the SRA files,iSeq will use thefasterq-dump tool to convert them into FASTQ files. Additionally, you can specify the number of threads for conversion using the-t parameter.

Note

Note1:-q is particularly useful for downloadingsingle-cell data, especially for scATAC-Seq data, as it can effectively decompose the files into four parts:I1,R1,R2,R3. However, if FASTQ files are directly downloaded via the-g parameter, onlyR1 andR3 files will be obtained (e.g.,SRR13450125), which may cause issues during subsequent data analysis.

Note

Note 2: When-q and-g are used together, the SRA file will first be downloaded, then converted toFASTQ files using thefasterq-dump tool, and finally compressed into gzip format usingpigz. It does not directly downloadFASTQ files in gzip format, which is very useful for obtaining comprehensive single-cell data.

5.-t,--threads

Specifies the number of threads to use for decompressing SRA files into FASTQ files or compressing FASTQ files. The default value is8.

iseq -i SRR1178105 -q -t 10

Considering that sequencing data files are generally large, you can specify the number of threads for decomposition using the-t parameter. However, more threads does not necessarily mean better performance because excessive threads can lead tohigh CPU or IO loads, especially sincefasterq-dump consumes a considerable amount of IO, potentially impacting the execution of other tasks. Based on thebenchmark evaluation, we recommend a maximum thread count of 15.

6.-e,--merge

Mergemultiple FASTQ files intoone FASTQ file for each Experiment (ex), Sample (sa) or Study (st) .

iseq -i SRX003906 -g -e ex

Although in most cases, an Experiment contains only one Run, some sequencing data may have multiple Runs within an Experiment (e.g.,SRX003906,CRX020217). Hence, you can use the-e parameter to merge multiple FASTQ files from an Experiment into one. Considering paired-end sequencing, wherefastq_1 andfastq_2 files need to be merged simultaneously and the sequence names in corresponding lines need to remain consistent,iSeq will merge multiple FASTQ files in thesame order. Ultimately, forsingle-end sequencing data, a single fileSRX*.fastq.gz will be generated, and forpaired-end sequencing data, two filesSRX*_1.fastq.gz andSRX*_2.fastq.gz will be generated.

Note

Note 1: If the accession is aRun ID, the-e parameter cannot be used (see below). Currently,iSeq supports mergingboth gzip-compressed and uncompressed FASTQ files, but does not support merging files such asBAM files and tar.gz files.

  • -e ex: merge all fastq files of the sameExperiment into one fastq file. Accepted accession format:ERX, DRX, SRX, CRX.
  • -e sa: merge all fastq files of the sameSample into one fastq file. Accepted accession format:ERS, DRS, SRS, SAMC, GSM.
  • -e st: merge all fastq files of the sameStudy into one fastq file. Accepted accession format:ERP, DRP, SRP, CRA.

Note

Note 2: Normally, when an Experiment contains only one Run, identical Runs should have thesame prefix. For example,SRR52991314_1.fq.gz andSRR52991314_2.fq.gz have the same prefixSRR52991314. In this case,iSeq will directlyrename them toSRX*_1.fastq.gz andSRX*_2.fastq.gz. However, there are exceptions, such as inCRX006713 where a RunCRR007192 contains files with different prefixes. In such cases,iSeq willrename them asSRX*_original_filename, for example, they will be renamed asCRX006713_CRD015671.gz andCRX006713_CRD015672.gz.

7.-d,--database

Specifies the database for downloading SRA files, supportingENA andSRA databases.

iseq -i SRR1178105 -d sra

By default,iSeq will automatically detect available databases, so specifying the-d parameter isusually unnecessary. However, some SRA files may downloadslowly from the ENA database. In such cases, you can force downloading from the SRA database by specifying-d sra.

Note

Note: If the corresponding SRA file is not found in theENA database, even if the-d ena parameter is specified,iSeq will still automatically switch to downloading from theSRA database.

8.-p,--parallel

Enablesmulti-threaded downloading and requires specifying the number of threads.

iseq -i PRJNA211801 -p 10

Considering thatwget may be slow in some cases, you can use the-p parameter to letiSeq utilize theaxel tool for multi-threaded downloading.

Note

Note 1: Theresumable download feature of multi-threaded downloading is only effective within thesame thread. That is, if the-p 10 parameter is used for the first download, it must also be used for the second download to enable resumable download.

Note

Note 2: As mentioned,iSeq will maintain 10 connections throughout the download process. Therefore, you will see multiple occurrences of the sameConnection * finished popping up during the download process. This is because some connections are released immediately after completing the download and then new connections are established for downloading.

9.-a,--aspera

Use Aspera for downloading.

iseq -i PRJNA211801 -a -g

As Aspera offers faster download speeds, you can use the-a parameter to instructiSeq to use theascp tool for downloading. Unfortunately, Aspera downloading is currentlyonly supported by the GSA and ENA databases. TheNCBI SRA database cannot utilize Aspera for downloading as it predominantly employs Google Cloud and AWS Cloud technologies and other reasons, seeAvoid-using-ascp.

Note

Note 1: When accessing theGSA database, if download links fromHuawei Cloud are available,iSeq will prioritize downloading through Huawei Cloud, even if the-a parameter is used. This is because Huawei Cloud offers faster and more stable download speeds. Therefore, when downloading GSA data, it'srecommended to use the-a parameter. This way, if access to Huawei Cloud is unavailable, downloading through the Aspera channel is still relatively fast. Otherwise, you'll have to resort to downloading viawget oraxel, which are slower methods.

Note

Note 2: SinceAspera requires a key file,iSeq willautomatically search for the key file in theconda environment or the~/.aspera directory. If the key file is not found, downloading will not be possible.

10.-o,--output

The output directory. If not exists, it will be created (default: current directory).

11.-s,--speed

Download speed limit (MB/s) (default: 1000 MB/s) forWget,AXEL andAspera.

Output

  • If the query accession inSRA/ENA/DDBJ/GEO database, the following files will be generated:
OutputDescription
SRA filesCan be converted to FASTQ files using-q option
.metadata.tsvMetadata for query accession
success.logSave the SRA file name that has been downloaded successfully
fail.logSave the SRA file name that has been downloaded failed
  • If the query accession inGSA database, the following files will be generated:
OutputDescription
GSA filesMostly in *.gz format, and a few are bam/tar/bz2 format
.metadata.csvMetadata for query accession
.metadata.xlsxMetadata for Project including query accession in xlsx format
success.logSave the GSA file name that has been downloaded successfully
fail.logSave the GSA file name that has been downloaded failed

Inspired

iSeq was inspired byfastq-dl,fetchngs,pysradb,Kingfisher. These excellent tools may also be very helpful. Below are multiple comparisons of different software:

Software nameProgram languagesSupported databasesSupported accessionsSupported formatsSupported methodsFetch metadataMD5 checkResumable downloadParallel downloadMerge FASTQSkip downloadedConda installableURL
iSeqShellGSA, SRA, ENA, DDBJ, GEOAllfq, fq.gz, sra, bamwget, axel, aspera🔗
edgeturboCGSAAll deniedfq, fq.gz, bamedgeturbo download🔗
SRA ToolkitCSRA, ENA, DDBJAll denied expect Run IDfq, fq.gz, sraprefetch🔗
enaBrowserToolsPythonSRA, ENA, DDBJAll except GSA/GEO IDfq, fq.gz, sraurllib, aspera🔗
fastq-dlPythonSRA, ENA, DDBJAll except GSA/GEO IDfq, fq.gz, sra, sra.litewget🔗
fetchngsPythonSRA, ENA, DDBJ, GEOAll except GSA IDfq, fq.gzwget, aspera, prefetch🔗
pysradbPythonSRA, ENA, DDBJ, GEOAll except GSA IDfq, fq.gz, sra, bamrequests, aspera🔗
KingfisherPythonSRA, ENA, DDBJAll except GSA/GEO IDfq, fq.gz, sracurl, aria2c, aspera🔗
ffqPythonSRA, ENA, DDBJ, GEOAll except GSA IDfq, fq.gz, sra, bamrequests🔗

Contributing

Contributions toiSeq are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on the project's GitHub repository. If you would like to contribute code, please fork the repository, make your changes, and submit a pull request.

Cite us:https://doi.org/10.1101/2024.05.16.594538

License

This project is licensed under theMIT License.


[8]ページ先頭

©2009-2025 Movatter.jp