RibosomeCRCL/ribomethseq-nfPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star7

Nextflow pipeline dedicated to RiboMethSeq data processing

License

GPL-3.0 license

7 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
Rscripts		Rscripts
data		data
docker		docker
nf-config		nf-config
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Repository files navigation

ribomethseq-nf

ribomethseq-nf is anextflow pipeline dedicated to RiboMethSeq data processing. It generates quality control data and counts, which can be directly used for further analyses using the rRMSAnalyzer package.

Versions

1.0 : first release

Software requirements

Nextflow (21.04 or later) is required to run this pipeline.

WarningOlder Nextflow versions before 20.10.0 will likely fail to run RiboMethSeq-nf.

The following software program/packages are also required to run this pipeline.

bowtie2
samtools [>=1.16.1]
bedtools
Trimmomatic
FastQC
MultiQC
pandoc
R and the following libraries :
- ade4
- dplyr
- tidyr
- tibble
- rmarkdown
- pheatmap

If you do not want to install them manually, you can use either docker/singularityor conda to build the required environment (see Installation section)

Note: Earlier version of samtools may work as well but will retain a few moreSAM records due to an issue in expression handling insamtools view prior toversion 1.15 (Seethis issue).In our case this is related to this filtering expression-d 'NM' -e '![XS]'givento samtools. This is expected to be really minor though.

General workflow description

The workflow is currently designed to process RiboMethSeq data generated insingle-end mode on Illumina sequencers. For each detected fastq file (one persample), quality control (FastQC) and reads trimming (Trimmomatic) steps arelaunched. Trimmed reads are then aligned on rRNA reference sequences (human or mouse,included in the pipeline) with Bowtie2 in end-to-end mode with--sensitive -L 17 parameters.Aligned files are then coordinate sorted and filtered with samtools to obtain uniquely mapped reads.From these, we generate count (coverage) files using bedtools. A MultiQC reportgathering metrics from FastQC, Trimmomatic and Bowtie2 is provided as well as anHTML report with custom QC metrics computed on the full dataset (all detectedfastq files).

Installation

To have the workflow installed you simply need to clone this repository.

git clone https://github.com/RibosomeCRCL/riboMethseq-nf

Conda environment

Perhaps the easiest way to have a proper environment to run the pipeline is touse conda. Two YAML files are provided :docker/rms-processing.yml anddocker/rms-report.yml corresponding to two distinct environments. The formercontains requirements for the processing part of the pipeline (samtools, bowtie2, ...)and the latter is dedicated to the report generation. We chose to split the twoenvironment because solving the full conda environment could take a really long time.

You can either directly use the conda profile provided innextflow.configand the workflow will automatically build the environment at workflowinitiation or you can build it in advance (recommended) with for instance thefollowing command:

cd dockerconda env create -f rms-processing.ymlconda env create -f rms-report.yml

If you create the environment prior to running the workflow, you will then needto adapt theprocess.conda directives from the conda profile innextflow.configas shown below :

conda {includeConfig "nf-config/exec.config"conda.enabled = trueprocess {withName: 'fastqc|trim|bowtie2|filter|multiqc|counts' {conda = "/path/to/your/conda/envs/rms-processing"}withName: 'split|report' {conda = "/path/to/your/conda/envs/rms-report"}}}

Docker image

A docker image can also be built using the provideddocker/Dockerfile.prod. Theimage will be built upon a debian 11 (bullseye) base image.

To build the docker image :

cd dockerdocker build -t ribomethseq-nf:1.0 -f Dockerfile.prod.

If you need to mount specific path(s) of your infrastructure, adapt the dockerprofile in the configuration file (process.containerOptions).

More information on docker image content in thedocker README.

Singularity image

There is no proper singularity recipe provided at the moment, but in the meantimeyou can convert the Docker image to a singularity image.

# first, create an archive of the docker imagedocker save ribomethseq-nf:1.0| gzip> ribomethseq-nf_1.0.tar.gz# (... somewhere else ...)# second, build a sif image from the archivesingularity build [--sandbox] ribomethseq-nf_1.0.sif docker-archive://ribomethseq-nf_1.0.tar.gz

If problems occur with the loop device, the option --sandbox can be used to build the singularity image.

The resulting ribomethseq-nf_1.0.sif must be in the directory specifed bysingularity.cacheDir in the configuration file.

If you need to mount specific path(s) of your infrastructure, adapt the Singularityprofile in the configuration file (process.containerOptions).

HPC environment

By default innf-config/exec.config, the executor is set toslurm. You mayneed to adapt it to your own computing infrastructure (e.g.pbs,LSF, or other)

You can also setparams.scheduler = 'local' if you plan to run locally on yourcomputer but then pay attention to control the queue size with--qsize (set to20 by default) to avoid launching to many jobs depending on your hardwarecapabilities.

Tests

Some tests are provided in thetests directory. Once you have set up yoursoftware environment :

cd tests# Test with tools installed in PATHmaketest# Test with conda profilemake test-conda# Test with docker profile on local machinemake test-docker SCHEDULER=local# Test with singularity profile on HPC with pbsmake test-singularity SCHEDULER=pbs# Test with your own created profile (e.g. `custom_profile`)maketest PROFILE=custom_profile# Clean the tests directorymake clean

Running the workflow

Let's suppose that you have cloned the repository in the following directory:/path/to/ribomethseq-nf

To run the workflow, you will need to specify both your execution environmentand your species (human or mouse) of interest. This is done here by combiningnextflow profiles.

human andmouse profiles are already available. They gather for each speciesall the required reference data and the bowtie indexes to save you some time.

Run the workflow for human data with docker

nextflow run /path/to/ribomethseq-nf -profile human,docker \    --fqdir$FastqDir \    --outdir$OutDir

Run the workflow for mouse data with conda

nextflow run /path/to/ribomethseq-nf -profile mouse,conda \    --fqdir$FastqDir \    --outdir$OutDir

SeeInput parameters section for a description of all available parameters.

Quickstart

Easiest way to get you started for the non-bioinformatician

You just need nextflow and conda available on your system.
Find out what is your job scheduler. e.g.pbs

Then:

nextflow RibosomeCRCL/ribomethseq-nf -profile conda,human --scheduler 'pbs' --qsize 10 --fqdir '/path/to/fastq/files'

This will automatically retrieve the nextflow pipeline from GitHub, build therequired conda environment and finally process your data. Output files will belocated in current directory (default).

Reference data used by the pipeline

The following reference rRNA are used :

Human

https://www.ncbi.nlm.nih.gov/nuccore/NR_046235 (18S, 28S and 5.8S)
https://www.ncbi.nlm.nih.gov/nuccore/NR_023363.1 (5S)

Mouse

The associated fasta sequences for human and mouse organisms are stored in thedata/fasta directory. It is possible to add other species, but it will benecessary to add a new profile innextflow.config file for them.

Precomputedbowtie2 indexes are also already provided for both human and mousein the folderdata/bowtie. If you need to use your own index, you can specifyit through the--bowtie_index parameter.

Input parameters

                            +++++++++++++++++++++++++                            +  ribomethseq-nf help  +                            +++++++++++++++++++++++++--fqdir              DIR    Fastq files location                       Required--fastq_pattern      STR    Pattern for fastq file selection           Optional (.fastq.gz)--adapters           FILE   (Trimmomatic) Path to illumina adapters    Optional ($baseDir/data/adapters/TruSeq3-SE.fa)--leading            INT    (Trimmomatic) LEADING parameter            Optional (30)--trailing           INT    (Trimmomatic) TRAILING parameter           Optional (30)--slidingwindow      STR    (Trimmomatic) SLIDINGWINDOW parameter      Optional (4:15)--avgqual            INT    (Trimmomatic) AVGQUAL parameter            Optional (30)--minlen             INT    (Trimmomatic) MINLEN parameter             Optional (8)--bowtie_index       FILE   (Bowtie) Path to index                     Optional ($baseDir/data/bowtie/human/human_index)--bowtie_opts        STR    (Bowtie) additional options                Optional (--sensitive -L 17)--samtools_opts      STR    (samtools) options to view                 Optional (--no-PG -h -u -d 'NM' -e '![XS]')--bowtie_threads     INT    Threads for bowtie                         Optional (7)--fastqc_threads     INT    Threads for fastqc                         Optional (2)--trimmo_threads     INT    Threads for trimmomatic                    Optional (3)--samtools_threads   INT    Threads for samtools                       Optional (4)--fastqcoutput       FLAG   Export fastqc logs (html and zip)          Optional (false)--trimoutput         FLAG   Export trimmomatic logs and trimmed fastq  Optional (false)--threeendcount      FLAG   Export 3'end read count                    Optional (false)--bowtieoutput       FLAG   Export Bowtie's logs                       Optional (false)--samtoolsoutput     FLAG   Export unique BAM files                    Optional (false)--split              FLAG   Split count files by RNA                   Optional (false)--scheduler          STR    Job scheduler                              Optional (slurm)--qsize              INT    Max number of parallel jobs                Optional (20)--outdir             DIR    Output directory                           Optional (.)--logdir             DIR    Log directory                              Optional ($outdir)--help               FLAG   Displays this help

Output files

Overview

By default, the pipeline will only copy 5' read-end count files (SeeRead-end count files and two quality control reports in the output directory :

multiqc_report.html : A RNA-seq report generated by MultiQC.
rms_report.html : A RiboMethSeq report generated (SeeRiboMethSeq (RMS) Report)

Example output :

ribomethseq├── counts│   └── 5p│       ├── sample1_R1_001.5_counts.csv│       ├── sample2_R1_001.5_counts.csv│       └── sample3_R1_001.5_counts.csv├── multiqc_report.html└── rms_report.html

Other directories can be copied by the pipeline in the output directory :

bowtie2 : BAM alignment files and logs from Bowtie 2. (--bowtieoutput)
counts/3p : 3' read-end count files. (--threeendcount)
trimmomatic : Trimmed fastq files and logs. (--trimoutput)
fastqc : fastqc report for each sample in both zip and html formats. (--fastqcoutput)

Read-end count files

The read-end count files represent the main output from this pipeline and are stored in the counts directory.By default, only the 5'end-read counts are exported (one file per sample). 3'end-read counts can also be exported alongside, using the--threeandcount parameter.

The files are in CSV format and have the following structure :

Ref RNA	position on ref RNA	5/3' read en count
NR_046235.3_5.8S	1	735
NR_046235.3_5.8S	2	173
NR_046235.3_5.8S	3	59
NR_046235.3_5.8S	4	32
NR_046235.3_5.8S	5	21

The column headers have been added in the above example for clarity, but the real output files do not have them.

RiboMethSeq (RMS) quality control Report

The RiboMethSeq QC report is stored in rms_report.html, at the root of the output directory.

It currently contains the following analyses :

A end-read count boxplot and RLE for each samples.
A distance heatmap to compare coverage profiles between samples.
A correspondence analysis of the coverage profiles.

About

Nextflow pipeline dedicated to RiboMethSeq data processing

Movatterモバイル変換

License

RibosomeCRCL/ribomethseq-nf

Folders and files

Latest commit

History

Repository files navigation

ribomethseq-nf

Versions

Software requirements

General workflow description

Installation

Conda environment

Docker image

Singularity image

HPC environment

Tests

Running the workflow

Run the workflow for human data with docker

Run the workflow for mouse data with conda

Quickstart

Reference data used by the pipeline

Human

Mouse

Input parameters

Output files

Overview

Read-end count files

RiboMethSeq (RMS) quality control Report

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Uh oh!

Languages

Packages