- Notifications
You must be signed in to change notification settings - Fork2
Nextflow pipeline dedicated to RiboMethSeq data processing
License
RibosomeCRCL/ribomethseq-nf
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ribomethseq-nf
is anextflow pipeline dedicated to RiboMethSeq data processing. It generates quality control data and counts, which can be directly used for further analyses using the rRMSAnalyzer package.
- 1.0 : first release
Nextflow (21.04 or later) is required to run this pipeline.
WarningOlder Nextflow versions before 20.10.0 will likely fail to run RiboMethSeq-nf.
The following software program/packages are also required to run this pipeline.
- bowtie2
- samtools [>=1.16.1]
- bedtools
- Trimmomatic
- FastQC
- MultiQC
- pandoc
- R and the following libraries :
- ade4
- dplyr
- tidyr
- tibble
- rmarkdown
- pheatmap
If you do not want to install them manually, you can use either docker/singularityor conda to build the required environment (see Installation section)
Note: Earlier version of samtools may work as well but will retain a few moreSAM records due to an issue in expression handling insamtools view
prior toversion 1.15 (Seethis issue).In our case this is related to this filtering expression-d 'NM' -e '![XS]'
givento samtools. This is expected to be really minor though.
The workflow is currently designed to process RiboMethSeq data generated insingle-end mode on Illumina sequencers. For each detected fastq file (one persample), quality control (FastQC) and reads trimming (Trimmomatic) steps arelaunched. Trimmed reads are then aligned on rRNA reference sequences (human or mouse,included in the pipeline) with Bowtie2 in end-to-end mode with--sensitive -L 17
parameters.Aligned files are then coordinate sorted and filtered with samtools to obtain uniquely mapped reads.From these, we generate count (coverage) files using bedtools. A MultiQC reportgathering metrics from FastQC, Trimmomatic and Bowtie2 is provided as well as anHTML report with custom QC metrics computed on the full dataset (all detectedfastq files).
To have the workflow installed you simply need to clone this repository.
git clone https://github.com/RibosomeCRCL/riboMethseq-nf
Perhaps the easiest way to have a proper environment to run the pipeline is touse conda. Two YAML files are provided :docker/rms-processing.yml
anddocker/rms-report.yml
corresponding to two distinct environments. The formercontains requirements for the processing part of the pipeline (samtools, bowtie2, ...)and the latter is dedicated to the report generation. We chose to split the twoenvironment because solving the full conda environment could take a really long time.
You can either directly use the conda profile provided innextflow.config
and the workflow will automatically build the environment at workflowinitiation or you can build it in advance (recommended) with for instance thefollowing command:
cd dockerconda env create -f rms-processing.ymlconda env create -f rms-report.yml
If you create the environment prior to running the workflow, you will then needto adapt theprocess.conda
directives from the conda profile innextflow.config
as shown below :
conda {includeConfig "nf-config/exec.config"conda.enabled = trueprocess {withName: 'fastqc|trim|bowtie2|filter|multiqc|counts' {conda = "/path/to/your/conda/envs/rms-processing"}withName: 'split|report' {conda = "/path/to/your/conda/envs/rms-report"}}}
A docker image can also be built using the provideddocker/Dockerfile.prod
. Theimage will be built upon a debian 11 (bullseye) base image.
To build the docker image :
cd dockerdocker build -t ribomethseq-nf:1.0 -f Dockerfile.prod.
If you need to mount specific path(s) of your infrastructure, adapt the dockerprofile in the configuration file (process.containerOptions
).
More information on docker image content in thedocker README.
There is no proper singularity recipe provided at the moment, but in the meantimeyou can convert the Docker image to a singularity image.
# first, create an archive of the docker imagedocker save ribomethseq-nf:1.0| gzip> ribomethseq-nf_1.0.tar.gz# (... somewhere else ...)# second, build a sif image from the archivesingularity build [--sandbox] ribomethseq-nf_1.0.sif docker-archive://ribomethseq-nf_1.0.tar.gz
If problems occur with the loop device, the option --sandbox can be used to build the singularity image.
The resulting ribomethseq-nf_1.0.sif must be in the directory specifed bysingularity.cacheDir
in the configuration file.
If you need to mount specific path(s) of your infrastructure, adapt the Singularityprofile in the configuration file (process.containerOptions
).
By default innf-config/exec.config
, the executor is set toslurm
. You mayneed to adapt it to your own computing infrastructure (e.g.pbs
,LSF
, or other)
You can also setparams.scheduler = 'local'
if you plan to run locally on yourcomputer but then pay attention to control the queue size with--qsize
(set to20 by default) to avoid launching to many jobs depending on your hardwarecapabilities.
Some tests are provided in thetests
directory. Once you have set up yoursoftware environment :
cd tests# Test with tools installed in PATHmaketest# Test with conda profilemake test-conda# Test with docker profile on local machinemake test-docker SCHEDULER=local# Test with singularity profile on HPC with pbsmake test-singularity SCHEDULER=pbs# Test with your own created profile (e.g. `custom_profile`)maketest PROFILE=custom_profile# Clean the tests directorymake clean
Let's suppose that you have cloned the repository in the following directory:/path/to/ribomethseq-nf
To run the workflow, you will need to specify both your execution environmentand your species (human or mouse) of interest. This is done here by combiningnextflow profiles.
human
andmouse
profiles are already available. They gather for each speciesall the required reference data and the bowtie indexes to save you some time.
nextflow run /path/to/ribomethseq-nf -profile human,docker \ --fqdir$FastqDir \ --outdir$OutDir
nextflow run /path/to/ribomethseq-nf -profile mouse,conda \ --fqdir$FastqDir \ --outdir$OutDir
SeeInput parameters
section for a description of all available parameters.
Easiest way to get you started for the non-bioinformatician
- You just need nextflow and conda available on your system.
- Find out what is your job scheduler. e.g.
pbs
Then:
nextflow RibosomeCRCL/ribomethseq-nf -profile conda,human --scheduler 'pbs' --qsize 10 --fqdir '/path/to/fastq/files'
This will automatically retrieve the nextflow pipeline from GitHub, build therequired conda environment and finally process your data. Output files will belocated in current directory (default).
The following reference rRNA are used :
- https://www.ncbi.nlm.nih.gov/nuccore/NR_046235 (18S, 28S and 5.8S)
- https://www.ncbi.nlm.nih.gov/nuccore/NR_023363.1 (5S)
- https://www.ncbi.nlm.nih.gov/nuccore/NR_030686.1 (5S)
- https://www.ncbi.nlm.nih.gov/nuccore/NR_003280.2 (5.8S)
- https://www.ncbi.nlm.nih.gov/nuccore/NR_003278.3 (18S)
- https://www.ncbi.nlm.nih.gov/nuccore/NR_003279.1 (28S)
The associated fasta sequences for human and mouse organisms are stored in thedata/fasta
directory. It is possible to add other species, but it will benecessary to add a new profile innextflow.config
file for them.
Precomputedbowtie2
indexes are also already provided for both human and mousein the folderdata/bowtie
. If you need to use your own index, you can specifyit through the--bowtie_index
parameter.
+++++++++++++++++++++++++ + ribomethseq-nf help + +++++++++++++++++++++++++--fqdir DIR Fastq files location Required--fastq_pattern STR Pattern for fastq file selection Optional (.fastq.gz)--adapters FILE (Trimmomatic) Path to illumina adapters Optional ($baseDir/data/adapters/TruSeq3-SE.fa)--leading INT (Trimmomatic) LEADING parameter Optional (30)--trailing INT (Trimmomatic) TRAILING parameter Optional (30)--slidingwindow STR (Trimmomatic) SLIDINGWINDOW parameter Optional (4:15)--avgqual INT (Trimmomatic) AVGQUAL parameter Optional (30)--minlen INT (Trimmomatic) MINLEN parameter Optional (8)--bowtie_index FILE (Bowtie) Path to index Optional ($baseDir/data/bowtie/human/human_index)--bowtie_opts STR (Bowtie) additional options Optional (--sensitive -L 17)--samtools_opts STR (samtools) options to view Optional (--no-PG -h -u -d 'NM' -e '![XS]')--bowtie_threads INT Threads for bowtie Optional (7)--fastqc_threads INT Threads for fastqc Optional (2)--trimmo_threads INT Threads for trimmomatic Optional (3)--samtools_threads INT Threads for samtools Optional (4)--fastqcoutput FLAG Export fastqc logs (html and zip) Optional (false)--trimoutput FLAG Export trimmomatic logs and trimmed fastq Optional (false)--threeendcount FLAG Export 3'end read count Optional (false)--bowtieoutput FLAG Export Bowtie's logs Optional (false)--samtoolsoutput FLAG Export unique BAM files Optional (false)--split FLAG Split count files by RNA Optional (false)--scheduler STR Job scheduler Optional (slurm)--qsize INT Max number of parallel jobs Optional (20)--outdir DIR Output directory Optional (.)--logdir DIR Log directory Optional ($outdir)--help FLAG Displays this help
By default, the pipeline will only copy 5' read-end count files (SeeRead-end count files and two quality control reports in the output directory :
- multiqc_report.html : A RNA-seq report generated by MultiQC.
- rms_report.html : A RiboMethSeq report generated (SeeRiboMethSeq (RMS) Report)
Example output :
ribomethseq├── counts│ └── 5p│ ├── sample1_R1_001.5_counts.csv│ ├── sample2_R1_001.5_counts.csv│ └── sample3_R1_001.5_counts.csv├── multiqc_report.html└── rms_report.html
Other directories can be copied by the pipeline in the output directory :
- bowtie2 : BAM alignment files and logs from Bowtie 2. (
--bowtieoutput
) - counts/3p : 3' read-end count files. (
--threeendcount
) - trimmomatic : Trimmed fastq files and logs. (
--trimoutput
) - fastqc : fastqc report for each sample in both zip and html formats. (
--fastqcoutput
)
The read-end count files represent the main output from this pipeline and are stored in the counts directory.By default, only the 5'end-read counts are exported (one file per sample). 3'end-read counts can also be exported alongside, using the--threeandcount
parameter.
The files are in CSV format and have the following structure :
Ref RNA | position on ref RNA | 5/3' read en count |
---|---|---|
NR_046235.3_5.8S | 1 | 735 |
NR_046235.3_5.8S | 2 | 173 |
NR_046235.3_5.8S | 3 | 59 |
NR_046235.3_5.8S | 4 | 32 |
NR_046235.3_5.8S | 5 | 21 |
The column headers have been added in the above example for clarity, but the real output files do not have them.
The RiboMethSeq QC report is stored in rms_report.html, at the root of the output directory.
It currently contains the following analyses :
- A end-read count boxplot and RLE for each samples.
- A distance heatmap to compare coverage profiles between samples.
- A correspondence analysis of the coverage profiles.
About
Nextflow pipeline dedicated to RiboMethSeq data processing
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors3
Uh oh!
There was an error while loading.Please reload this page.