- Notifications
You must be signed in to change notification settings - Fork3
FASTQ-to-CodFreq pipeline for HIV-1 and SARS-CoV-2
License
hivdb/codfreq
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CodonFrequency Table Format
The HIVDB Sequence Reads Interpretation Program accepts a codon frequency tablethat stores in theCodFreq format. The CodFreq format consists of fivecolumns:
- gene (
PR
,RT
, orIN
); - position;
- total number of reads of this position;
- codon nucleotide triplet; and
- total number of reads of this codon.
This repository contains CodFreq filesgenerated from publicly available SRA sequences. We have also included threeselected files from studies that utilize Illumina sequencing. To analyze thesefiles, first download one or more CodFreq example files. Then, submit them to theHIVDB Interpretation Program foranalysis.
Install Docker CE (https://docs.docker.com/install/).
Download script:
sudo curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/bin-wrapper/align-all-docker -o /usr/local/bin/fastq2codfreqsudo chmod +x /usr/local/bin/fastq2codfreq
Download alignment profiles:
mkdir profilescurl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/profiles/HIV1.json -o profiles/HIV1.jsoncurl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/profiles/SARS2.json -o profiles/SARS2.json
Use following command to process FASTQ files and generate CodFreq files.
fastq2codfreq -r profiles/HIV1.json -d path/to/fastq/folders
The script will automatically find every file named with an extension of
.fastq
, align them to.sam
file and then extract the codon freqency tableinto.codfreq
file.The above command is adequate for most case of both paired or unpaired FASTQfiles generated by Illumina with the filename pattern looks like
*_L001_R1_001.fastq.gz
and*_L001_R1_002.fastq.gz
. However, if your FASTQfiles are in other naming convention, please readAdvanced usages § Manuallypairing FASTQ files.
Note: thefastq2codfreq
script can only be executed in an Unix-like system. If you are using Microsoft Windows 10,you need to install theWindows Subsystem for Linux touse this script.
Thefastq2codfreq
command can be used offline, although the usage is slightlydifferent from the above description. Followings are the differences:
- Docker's installation package, the
fastq2codfreq
script and the alignmentprofiles can be transfered to the offline server using a portable drive. - Docker image used by
fastq2codfreq
can be downloaded into a binary file, andtransfer to the offline server using a portable drive.# Run this command on a computer with Internet accessdocker save hivdb/codfreq-runner:latest| gzip> codfreq-runner.tar.gz# Run this command on the offline serverdocker load< codfreq-runner.tar.gz
- The auto-update option of
fastq2codfreq
should also be disabled withargument-s
:fastq2codfreq -s -r profiles/HIV1.json -d path/to/fastq/folders
A flag argument-m
can be added tofastq2codfreq
command to dissableauto-pairing FASTQ files.
fastq2codfreq -m -r profiles/HIV1.json -d path/to/fastq/folders
With paired FASTQ files, a single CodFreq file will be generated by the process.The program will try to match the FASTQ files with similar names as paired FASTQfiles. To change this behavior, apairinfo.json
file can be supplied under thesame folder that includes FASTQ files. We have provided an example file atexamples/pairinfo.json
.
Programfastp is by default used to trimadapters, filter low quality regions and reads which are too short.examples/fastp-config.json
listed all fastp options supported by this pipeline. Please refer tofastp'sdocumentation for the usage andexplanation of these options.
To apply your customized settings, make afastp-config.json
file and save itunder the same folder that includes FASTQ files. You can also disable adaptertrimming, low phred quality filtering or length filtering by set thecorresponding disabling flags totrue
.
CodFreq pipeline supports trimming FASTA format primer sequences by usingcutadapt.examples/cutadapt-config.json
listed all cutadapt options supported by this pipeline. Please refer tocutadapt'sreference guide for theusage and explanation of these options.
Three type of optional FASTA primer files can be supplied under the same folderthat includes the FASTQ files:primers3.fa
,primers5.fa
andprimers53.fa
which corresponding to the “3’ adapters”, “5’ adapters”, and “5’ or 3’ adapters”described incutadapt's userguide.
To enable primer trimming (FASTA), you must make a validcutadapt-config.json
file under the same folder that includes FASTQ files.
CodFreq pipeline supports trimming BED format primer locations by usingivar.examples/ivar-trim-config.json
listed allivar trim
options supported by this pipeline. Please refer toivar'smanual for theusage and explanation of these options.
A BED primer file can be supplied under the same folder that includes the FASTQfiles:primers.bed
(example:examples/primers.bed
).ivar requires a BED6 format which is a tab-delimited file include following sixcolumns (no header): reference, start, end, name, score, and strand. We havereviewed ivar 4.1 source code and have confirmed that only four columns - start,end, name, and strand are used by ivar. The other two (reference and score) canbe just supplied in any values for completing the BED6 format.
To enable primer trimming (BED), you must make a validivar-trim-config.json
fileunder the same folder that includes FASTQ files.
A script using only the standard Python library is provided to consolidate acodon frequency table (.codfreq or .codfreq.gz file) into an amino acidfrequency table (.aafreq.csv file). The script merges rows of codons that can betranslated into the same amino acid.
This script requires Python 3.9 or higher version to be installed. This requiredPython runtime is included in the latest version of MacOS and most Linuxreleases. To install the latest Python version, please follow theofficialwebsite.
To use this script:
Download the script:
sudo curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/scripts/codfreq2aafreq.py -o /usr/local/bin/codfreq2aafreqsudo chmod +x /usr/local/bin/codfreq2aafreq
Run the script:
codfreq2aafreq dir/to/read/codfreqs dir/to/write/aafreqs