BioOmics/iSeqPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star202

Download sequencing data and metadata from GSA, SRA, ENA, and DDBJ databases.

License

MIT license

202 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
bin		bin
docs		docs
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
iSeq.yml		iSeq.yml

Repository files navigation

iSeq: Anintegrated tool to fetch publicSequencing data

Cite us: Haoyu Chao, Zhuojin Li, Dijun Chen, Ming Chen, iSeq: An integrated tool to fetch public sequencing data,Bioinformatics, 2024, btae641,https://doi.org/10.1093/bioinformatics/btae641

Description

iSeq is a Bash script that allows you to download sequencing data and metadata fromGSA,SRA,ENA, andDDBJ databases. SeeDetail Pipeline for iSeq. Here is the basic pipeline of iSeq:

Important

To use iSeq, Your system must beconnected to the network andsupport FTP, HTTP, and HTTPS protocols.

Update Notes:

2025.06.16

When using-e,--merge, create symbolic links or retain the original Run files to avoid re-downloading them after merging.
Fixed the issue mentioned in#40: modified the behavior so that batch downloads do not terminate upon encountering an error, and instead continue until all items are processed.
Added an error message when download failures occur, such asDownload failures detected, please check fail.log for details.
Fixed a bug where incomplete downloads from GSA were incorrectly reported as successful.

2025.05.23

Fixed the issue mentioned in#39. The problem was that using both-d sra and-g together would skip the MD5 check invdb-validate.
New-k,--skip-md5 option: Added this option to disable MD5 checks.

More Updates

2025.04.25

Fixed a bug that occurred when re-downloading with empty metadata.
Fixed a bug where the while loop exited abnormally with a non-zero exit code.

2025.04.22

Fixed the issue mentioned in#33.-s,--speed re-enable use.
Fix the exception when the metadata file is empty, mentioned in#34
Bug fix to resolve the issue of MD5 checksum failure when downloading ONT or HiFi third-generation sequencing gzip data.

2025.04.02

Fixed the issue mentioned in#27 andRednote: Insra-tools > 3.0.0, runningvdb-validate without specifying the SRA file path causes it to re-download the file, leading to a stuck process. Specifying the path (e.g.,vdb-validate ./SRR931847) resolves the issue.

2025.03.14

Fixed the issue mentioned in#26. The cause was that the data waspaired-end but had onlyone link, such asSRR23680070.

2025.03.11

Input file can contain accessions from different databases.
-p and-a can be used simultaneously , with-a taking priority.
Fixed some bugs when retrying download data from GSA.

2024.12.26

Fixed the bugs mentioned in#16,#17 (2024.12.16) and#19 (2024.12.26).

2024.11.21

Dependency update for aspera-cli:The version requirement for aspera-cli has been updated fromaspera-cli toaspera-cli=4.14.0.

2024.10.23

New-s,--speed option to set the download speed limit (MB/s) (default: 1000 MB/s). Such asiseq -i SRR7706354 -s 10
Dependency update for sra-tools:The version requirement for sra-tools has been updated fromsra-tools=2.11 tosra-tools>=2.11.0.

2024.09.14

New-e option for merging FASTQ files:Added a-e option to merge multiple FASTQ files into a single file for eachExperiment (-e ex),Sample (-e sa), orStudy (-e st).
New-i option for input:iSeq can now accept afile containing multiple accession numbers as input by-i fileName.
API change for GSA metadata download:The API endpoint has been updated fromgetRunInfo togetRunInfoByCra for downloading GSA metadata.
Save result to personal directory:The output results will now be saved in the user's personal directory by-o option.
Updated regex for SAMC matching:The matching pattern for SAMC has been changed fromSAMC[A-Z]?[0-9]+ toSAMC[0-9]+.
Fix some bugs

Features

Multiple Database Support: Supports multiple bioinformatics databases (GSA/SRA/ENA/DDBJ/GEO).
Multiple Input Formats: Supports multiple accessions (Project, Study, Sample, Experiment, or Run accession).
Metadata Download: Supports download sample metadata for each accession.

More features

File Format Selection: Users can choose to directly download gzip-formatted FASTQ files or download SRA files and convert them to FASTQ format.
Multi-threading Support: Supports the use of multi-threading to accelerate the conversion of SRA to FASTQ files or the compression of FASTQ files.
File Merging: For experiment-level accession, the script can merge multiple FASTQ files into one.
Parallel Download: Supports parallel download connections, allowing the specification of the number of connections to speed up download speeds.
Support for Aspera High-speed Download: For GSA/ENA databases, the script supports high-speed data transfer using Aspera.
Automatic Retry Mechanism: If a download or verification fails, the script will automatically retry until a set number of attempts have been reached.
Automated File Verification: After the download is complete, the script will automatically verify the integrity of the files, including checking file sizes and MD5 checksums.
Error Handling: The script provides error messages and suggestions for solutions when encountering errors.

Installation

1.iSeq can be installed by conda easily

conda install bioconda::iseq

If conda Found conflicts! You can tryconda install -c conda-forge -c bioconda iseq

2. The latest version of iSeq can also be installed from source, seeINSTALL

# Use the following command to check whether dependent software is installediseq --version

Example (See more)

Download all Run sequencing data and metadata associated with an accession.

iseq -i PRJNA211801

Batch download by Aspera with-a to directly download gzip-formatted FASTQ files with-g.

iseq -i SRR_Acc_List.txt -a -g

Usage (中文教程✨)

$ iseq --helpUsage:  iseq -i accession [options]Required option:  -i, --input     [text|file]   Single accession or a file containing multiple accessions.                                Note: Only one accession per line in the fileOptional options:  -m, --metadata                Skip the sequencing data downloads and only fetch the metadata for the accession.  -g, --gzip                    Download FASTQ files in gzip format directly (*.fastq.gz).                                Note: if *.fastq.gz files are not available, SRA files will be downloaded and converted to *.fastq.gz files.  -q, --fastq                   Convert SRA files to FASTQ format.  -t, --threads   int           The number of threads to use for converting SRA to FASTQ files or compressing FASTQ files (default: 8).  -e, --merge     [ex|sa|st]    Merge multiple fastq files into one fastq file for each Experiment, Sample or Study.  -d, --database  [ena|sra]     Specify the database to download SRA sequencing data (default: ena).                                Note: new SRA files may not be available in the ENA database, even if you specify "ena".  -p, --parallel  int           Download sequencing data in parallel, the number of connections needs to be specified, such as -p 10.                                Note: breakpoint continuation cannot be shared between different numbers of connections.  -a, --aspera                  Use Aspera to download sequencing data, only support GSA/ENA database.  -s, --speed     int           Download speed limit (MB/s) (default: 1000 MB/s).  -k, --skip-md5                Skip the md5 check for the downloaded files.  -o, --output    text          The output directory. If not exists, it will be created (default: current directory).  -h, --help                    Show the help information.  -v, --version                 Show the script version.

1.`-i`,`--input`

Input the accession you want to download, You also can input a file containing multiple accessions (Only one accession per line in the file).

iseq -i PRJNA211801

Firstly,iSeq will retrieve the metadata of the accession, then proceed to download each Run contained within.

Currentlysupports 6 accession formats from the following5 databases, with supported accession prefixes as follows:

Databases	BioProject	Study	BioSample	Sample	Experiment	Run
GSA	PRJC	CRA	SAMC	\	CRX	CRR
SRA	PRJNA	SRP	SAMN	SRS	SRX	SRR
ENA	PRJEB	ERP	SAME	ERS	ERX	ERR
DDBJ	PRJDB	DRP	SAMD	DRS	DRX	DRR
GEO	GSE	\	GSM	\	\	\

Additionally, for the two data formats (GSE/GSM) from the GEO database, it will directly retrieve the associatedPRJNA/SAMN, then proceed to obtain the contained Runs and download the sequencing data. Therefore, essentially, it still downloads sequencing data from the SRA database.

Here are some examples:

Accession Type	Prefixes	Example
BioProject	PRJEB, PRJNA, PRJDB, PRJC, GSE	PRJEB42779, PRJNA480016, PRJDB14838, PRJCA000613, GSE122139
Study	ERP, DRP, SRP, CRA	ERP126685, DRP009283, SRP158268, CRA000553
BioSample	SAMD, SAME, SAMN, SAMC	SAMD00258402, SAMEA7997453, SAMN06479985, SAMC017083
Sample	ERS, DRS, SRS, GSM	ERS5684710, DRS259711, SRS2024210, GSM7417667
Experiment	ERX, DRX, SRX, CRX	ERX5050800, DRX406443, SRX4563689, CRX020217
Run	ERR, DRR, SRR, CRR	ERR5260405, DRR421224, SRR7706354, CRR311377

In summary, regardless of the data format of your accession among the six options, it will eventually download andcheck the MD5 value of each contained Run. If the MD5 value does not match that in the public database, it will attempt a maximum ofthree rounds of re-downloading. If successful after three attempts of downloading and verification, the file name will be stored insuccess.log; otherwise, if the download fails, the file name will be stored infail.log.

2.`-m`,`--metadata`

Download only the sample information of the accession and skip the download of sequencing data.

iseq -i PRJNA211801 -miseq -i CRR343031 -m

Therefore, regardless of whether the-m parameter is used or not, the sample information of the accession will be obtained. If metadata cannot be retrieved, theiSeq program will exit without proceeding to the subsequent download.

Note

Note 1: If the retrieved accession is in theSRA/ENA/DDBJ/GEO databases,iSeq will first search in the ENA database. If sample information can be retrieved, it will download metadata inTSV format via theENA API, typically containing 191 columns. However, some recently released data in the SRA database may not be promptly synchronized to the ENA database. Therefore, if metadata cannot be obtained from the ENA database,iSeq will directly download metadata inCSV format via theSRA Database Backend, typically containing 30 columns. To maintain consistency with the TSV format, it will be converted to TSV format usingsed -i 's/,/\t/g'. However, if a single field contains a comma, it may cause column disorder. Ultimately, you will obtain sample information named${accession}.metadata.tsv.

Note

Note 2: If the retrieved accession is in theGSA database,iSeq will obtain sample information via GSA'sgetRunInfo interface, downloading metadata inCSV format, typically containing 25 columns. The metadata obtained above will be saved as${accession}.metadata.csv. To supplement more detailed metadata information, iSeq will automatically obtain metadata information for the Project to which the accession belongs via GSA'sexportExcelFile interface, downloading metadata inXLSX format, typically with 3 sheets:Sample,Experiment,Run. The final metadata information will be saved as${accession}.metadata.xlsx. In summary, you will ultimately obtain sample information named${accession}.metadata.csv andCRA*.metadata.xlsx.

3.`-g`,`--gzip`

Directly download FASTQ files ingzip format. If direct download is not possible, SRA files will be downloaded and converted to gzip format using multi-threading for decomposition and compression.

iseq -i SRR1178105 -g

Since the majority of data formats stored directly in theGSA database are in gzip format, if the accession being searched for is from the GSA database, whether the-g parameter is used or not, you can directly download FASTQ files in gzip format.

If the accession is from theSRA/ENA/DDBJ/GEO databases,iSeq will first attempt to access the ENA database. If it can directly download FASTQ files in gzip format, it will do so; otherwise, it will download SRA files and convert them to FASTQ format using thefasterq-dump tool, then compress the FASTQ files using thepigz tool, ultimately obtaining FASTQ files in gzip format.

Tip

parallel-fastq-dump can also convert SRA to gzip-compressed FASTQ files, typically2-3 times faster thanfasterq-dump + pigz. However, consideringIO limitations,iSeq currently does not supportparallel-fastq-dump.

4.`-q`,`--fastq`

After downloading the SRA files, they will be decomposed into multipleuncompressed FASTQ files.

iseq -i SRR1178105 -q

This parameter is only effective when the accession is from theSRA/ENA/DDBJ/GEO databases and the downloaded files areSRA files. After downloading the SRA files,iSeq will use thefasterq-dump tool to convert them into FASTQ files. Additionally, you can specify the number of threads for conversion using the-t parameter.

Note

Note1:-q is particularly useful for downloadingsingle-cell data, especially for scATAC-Seq data, as it can effectively decompose the files into four parts:I1,R1,R2,R3. However, if FASTQ files are directly downloaded via the-g parameter, onlyR1 andR3 files will be obtained (e.g.,SRR13450125), which may cause issues during subsequent data analysis.

Note

Note 2: When-q and-g are used together, the SRA file will first be downloaded, then converted toFASTQ files using thefasterq-dump tool, and finally compressed into gzip format usingpigz. It does not directly downloadFASTQ files in gzip format, which is very useful for obtaining comprehensive single-cell data.

5.`-t`,`--threads`

Specifies the number of threads to use for decompressing SRA files into FASTQ files or compressing FASTQ files. The default value is8.

iseq -i SRR1178105 -q -t 10

Considering that sequencing data files are generally large, you can specify the number of threads for decomposition using the-t parameter. However, more threads does not necessarily mean better performance because excessive threads can lead tohigh CPU or IO loads, especially sincefasterq-dump consumes a considerable amount of IO, potentially impacting the execution of other tasks. Based on thebenchmark evaluation, we recommend a maximum thread count of 15.

6.`-e`,`--merge`

Mergemultiple FASTQ files intoone FASTQ file for each Experiment (ex), Sample (sa) or Study (st) .

iseq -i SRX003906 -g -e ex

Although in most cases, an Experiment contains only one Run, some sequencing data may have multiple Runs within an Experiment (e.g.,SRX003906,CRX020217). Hence, you can use the-e parameter to merge multiple FASTQ files from an Experiment into one. Considering paired-end sequencing, wherefastq_1 andfastq_2 files need to be merged simultaneously and the sequence names in corresponding lines need to remain consistent,iSeq will merge multiple FASTQ files in thesame order. Ultimately, forsingle-end sequencing data, a single fileSRX*.fastq.gz will be generated, and forpaired-end sequencing data, two filesSRX*_1.fastq.gz andSRX*_2.fastq.gz will be generated.

Note

Note 1: If the accession is aRun ID, the-e parameter cannot be used (see below). Currently,iSeq supports mergingboth gzip-compressed and uncompressed FASTQ files, but does not support merging files such asBAM files and tar.gz files.

-e ex: merge all fastq files of the sameExperiment into one fastq file. Accepted accession format:ERX, DRX, SRX, CRX.
-e sa: merge all fastq files of the sameSample into one fastq file. Accepted accession format:ERS, DRS, SRS, SAMC, GSM.
-e st: merge all fastq files of the sameStudy into one fastq file. Accepted accession format:ERP, DRP, SRP, CRA.

Note

Note 2: Normally, when an Experiment contains only one Run, identical Runs should have thesame prefix. For example,SRR52991314_1.fq.gz andSRR52991314_2.fq.gz have the same prefixSRR52991314. In this case,iSeq will directlyrename them toSRX*_1.fastq.gz andSRX*_2.fastq.gz. However, there are exceptions, such as inCRX006713 where a RunCRR007192 contains files with different prefixes. In such cases,iSeq willrename them asSRX*_original_filename, for example, they will be renamed asCRX006713_CRD015671.gz andCRX006713_CRD015672.gz.

7.`-d`,`--database`

Specifies the database for downloading SRA files, supportingENA andSRA databases.

iseq -i SRR1178105 -d sra

By default,iSeq will automatically detect available databases, so specifying the-d parameter isusually unnecessary. However, some SRA files may downloadslowly from the ENA database. In such cases, you can force downloading from the SRA database by specifying-d sra.

Note

Note: If the corresponding SRA file is not found in theENA database, even if the-d ena parameter is specified,iSeq will still automatically switch to downloading from theSRA database.

8.`-p`,`--parallel`

Enablesmulti-threaded downloading and requires specifying the number of threads.

iseq -i PRJNA211801 -p 10

Considering thatwget may be slow in some cases, you can use the-p parameter to letiSeq utilize theaxel tool for multi-threaded downloading.

Note

Note 1: Theresumable download feature of multi-threaded downloading is only effective within thesame thread. That is, if the-p 10 parameter is used for the first download, it must also be used for the second download to enable resumable download.

Note

Note 2: As mentioned,iSeq will maintain 10 connections throughout the download process. Therefore, you will see multiple occurrences of the sameConnection * finished popping up during the download process. This is because some connections are released immediately after completing the download and then new connections are established for downloading.

9.`-a`,`--aspera`

Use Aspera for downloading.

iseq -i PRJNA211801 -a -g

As Aspera offers faster download speeds, you can use the-a parameter to instructiSeq to use theascp tool for downloading. Unfortunately, Aspera downloading is currentlyonly supported by the GSA and ENA databases. TheNCBI SRA database cannot utilize Aspera for downloading as it predominantly employs Google Cloud and AWS Cloud technologies and other reasons, seeAvoid-using-ascp.

Note

Note 1: When accessing theGSA database, if download links fromHuawei Cloud are available,iSeq will prioritize downloading through Huawei Cloud, even if the-a parameter is used. This is because Huawei Cloud offers faster and more stable download speeds. Therefore, when downloading GSA data, it'srecommended to use the-a parameter. This way, if access to Huawei Cloud is unavailable, downloading through the Aspera channel is still relatively fast. Otherwise, you'll have to resort to downloading viawget oraxel, which are slower methods.

Note

Note 2: SinceAspera requires a key file,iSeq willautomatically search for the key file in theconda environment or the~/.aspera directory. If the key file is not found, downloading will not be possible.

10.`-o`,`--output`

The output directory. If not exists, it will be created (default: current directory).

11.`-s`,`--speed`

Download speed limit (MB/s) (default: 1000 MB/s) forWget,AXEL andAspera.

12`-k`,`--skip-md5`

Starting from v1.9.2, you can choose to skip the MD5 file integrity check. If you want to perform the MD5 check again after skipping it, simply remove the-k parameter and run the same command.

Output

If the query accession inSRA/ENA/DDBJ/GEO database, the following files will be generated:

Output	Description
SRA files	Can be converted to FASTQ files using`-q` option
.metadata.tsv	Metadata for query accession
success.log	Save the SRA file name that has been downloaded successfully
fail.log	Save the SRA file name that has been downloaded failed

If the query accession inGSA database, the following files will be generated:

Output	Description
GSA files	Mostly in *.gz format, and a few are bam/tar/bz2 format
.metadata.csv	Metadata for query accession
.metadata.xlsx	Metadata for Project including query accession in xlsx format
success.log	Save the GSA file name that has been downloaded successfully
fail.log	Save the GSA file name that has been downloaded failed

Inspired

iSeq was inspired byfastq-dl,fetchngs,pysradb,Kingfisher. These excellent tools may also be very helpful. Below are multiple comparisons of different software:

Software name	Program languages	Supported databases	Supported accessions	Supported formats	Supported methods	Fetch metadata	MD5 check	Resumable download	Parallel download	Merge FASTQ	Skip downloaded	Conda installable	URL
iSeq	Shell	GSA, SRA, ENA, DDBJ, GEO	All	fq, fq.gz, sra, bam	wget, axel, aspera	✔	✔	✔	✔	✔	✔	✔	🔗
edgeturbo	C	GSA	All denied	fq, fq.gz, bam	edgeturbo download	❌	❌	✔	❌	❌	❌	❌	🔗
SRA Toolkit	C	SRA, ENA, DDBJ	All denied expect Run ID	fq, fq.gz, sra	prefetch	❌	✔	✔	❌	❌	✔	✔	🔗
enaBrowserTools	Python	SRA, ENA, DDBJ	All except GSA/GEO ID	fq, fq.gz, sra	urllib, aspera	✔	✔	✔	❌	❌	✔	✔	🔗
fastq-dl	Python	SRA, ENA, DDBJ	All except GSA/GEO ID	fq, fq.gz, sra, sra.lite	wget	✔	✔	❌	❌	✔	✔	✔	🔗
fetchngs	Python	SRA, ENA, DDBJ, GEO	All except GSA ID	fq, fq.gz	wget, aspera, prefetch	✔	✔	✔	❌	❌	✔	❌	🔗
pysradb	Python	SRA, ENA, DDBJ, GEO	All except GSA ID	fq, fq.gz, sra, bam	requests, aspera	✔	✔	✔	❌	❌	✔	✔	🔗
Kingfisher	Python	SRA, ENA, DDBJ	All except GSA/GEO ID	fq, fq.gz, sra	curl, aria2c, aspera	✔	✔	❌	✔	❌	✔	✔	🔗
ffq	Python	SRA, ENA, DDBJ, GEO	All except GSA ID	fq, fq.gz, sra, bam	requests	✔	✔	❌	❌	❌	❌	✔	🔗

Contributing

Contributions toiSeq are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on the project's GitHub repository. If you would like to contribute code, please fork the repository, make your changes, and submit a pull request.

Cite us:https://doi.org/10.1101/2024.05.16.594538

License

This project is licensed under theMIT License.

About

Download sequencing data and metadata from GSA, SRA, ENA, and DDBJ databases.

anaconda.org/bioconda/iseq

Releases13

v1.9.3 Latest

Jun 24, 2025

+ 12 releases

Contributors2

Languages

Shell100.0%

Movatterモバイル変換

License

BioOmics/iSeq

Folders and files

Latest commit

History

Repository files navigation

iSeq: Anintegrated tool to fetch publicSequencing data

Description

Update Notes:

2025.06.16

2025.05.23

2025.04.25

2025.04.22

2025.04.02

2025.03.14

2025.03.11

2024.12.26

2024.11.21

2024.10.23

2024.09.14

Features

Installation

1.iSeq can be installed by conda easily

2. The latest version of iSeq can also be installed from source, seeINSTALL

Example (See more)

Usage (中文教程✨)

1.-i,--input

2.-m,--metadata

3.-g,--gzip

4.-q,--fastq

5.-t,--threads

6.-e,--merge

7.-d,--database

8.-p,--parallel

9.-a,--aspera

10.-o,--output

11.-s,--speed

12-k,--skip-md5

Output

Inspired

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases13

Contributors2

Uh oh!

Languages

1.`-i`,`--input`

2.`-m`,`--metadata`

3.`-g`,`--gzip`

4.`-q`,`--fastq`

5.`-t`,`--threads`

6.`-e`,`--merge`

7.`-d`,`--database`

8.`-p`,`--parallel`

9.`-a`,`--aspera`

10.`-o`,`--output`

11.`-s`,`--speed`

12`-k`,`--skip-md5`