Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

snakemake workflow for basecalling and demultiplexing of ONT sequencing data

License

NotificationsYou must be signed in to change notification settings

vibaotram/baseDmux

Repository files navigation

A Snakemake workflow for basecalling and gathering ONT reads originating from disparate runs and barcodes

Basecalling by GUPPY + Demultiplexing by GUPPY and/or DEEPBINNER + MinIONQC/Multiqc + QC reports + reads aggregation into bins + fastq reads trimming + filtering

Requirements

  • singularity >= 2.5
  • conda >=4.3 + Mamba

Implemented tools

  • Snakemake
  • Guppy
  • Deepbinner
  • MinIONQC
  • Multiqc
  • Porechop
  • Filtlong

We try to update the tools regularly. See versions in the folders containningconda environment andsingularity container recipie files.

More details about individual snakemake Rules

  • Guppy basecalling
    Runguppy_basecaller with filtering reads, then subset fast5 reads from passed reads list (passed_sequencing_summary.txt).

  • Guppy demultiplexing
    Runguppy_barcoder with passed fastq, then subset fastq to classified barcode folders based onbarcoding_summary.txt.

  • Multi to single fast5
    Convert passed multi-read fast5 files to single-read fast5 file, preparing for deepbinner.

  • Deepbinner classification
    Rundeepbinner classify with pass single-read fast5, output classification file.

  • Deepbinner bin
    Classify passed fastq based on classification file, then subset fastq to barcode folders.

  • Get sequencing summary per barcode
    Subsetpassed_sequencing_summary.txt according to barcode ids, preparing for minionqc/multiqc of each barcode and subseting fast5 reads per barcode (get multi fast5 per barcode).

  • MinIONQC and Multiqc
    After basecalling, MinIONQC is performed for each run, and Multiqc reports all run collectively.On the other hand, after demultiplexing, MinIONQC runs for each barcode separately then Multiqc aggregates MinIONQC results of all barcodes.

  • Demultiplex report (optional)
    Compare demultiplexing results from different runs, and from different demultiplexers (guppy and/or deepbinner) by analyzing information ofmultiqc_minionqc.txt. It is only available when demultiplexing rules are executed.

  • Get reads per genome (optional)
    Combine and concatenate fast5 and fastq barcodes for genomes individually based on the demultiplexer program, preparingforfurther genome assembly, following the information in thebarcodeByGenome_sample.tsv tabulated file (column names of this table should not bemodified).
    Caution: if guppy or deepbinner is on Demultiplexer of the barcodeByGenome table, it will beexecuted even it is not specified in config['DEMULTIPLEXER'].

  • Porechop (optional)
    Find and remove adapters from reads. Seehere for more information.

  • Filtlong (optional)
    Filter reads by length and by quality. More details ishere. Several filtlong runs at the same time are enabled.

Singularity containers

Workflow jobs run inside Singularity images (seeour Singularity Recipe files).

The latest containers will be automatically downloaded and intalled in the baseDmux environement installationdirectory. They can anyhow be manually downloaded fromIRD Drive.

Custom Singularity images can be specified by editing the./baseDmux/data/singularity.yaml file right after clonning the github repository or directly in your baseDmux installation (see below) location.

Conda environments

Inside of the Singularity images, individual Snakemake rules use dedicated condaenvironments yaml files that are locatedthere.

  • minionqc
  • multiqc
  • rmarkdown
  • porechop
  • filtlong

Installation

Download the package:

git clone https://github.com/vibaotram/baseDmux.gitcd ./baseDmux

And then, install in a virtualenv...

make installsource venv/bin/activate

... or install in a conda environment

conda env create -n baseDmux -f environment.yamlconda activate baseDmuxpip install .

It is recommended to first run the local test below with the toy dataset to make sure everything works well. On thefirst invokation, this will download and install the Singularity images and setup the Conda environment. Thisprocess takes time, so be patient. Note also that in the end, this setup amounts to a total of about 12GB of files, so you need some room on the installation disk.

Usage

Run baseDmux version 1.1.0 ... See https://github.com/vibaotram/baseDmux/blob/master/README.md for moredetailspositional arguments:  {configure,run,dryrun,version_tools}    configure           edit config file and profile    run                 run baseDmux    dryrun              dryrun baseDmux    version_tools       check version for the tools of baseDmuxoptions:  -h, --help            show this help message and exit  -v, --version         show program's version number and exit

baseDmux configuration

Because configuring snakemake workflows can be a bit intimidating, we try to clarify below the main principles ofbaseDmux configuration.

This is done primarilly by adjusting the parameters listed in the workflow config fileprofile/workflow_parameters .yaml (generated bybaseDmux configure - see below). It enables to setup input reads, output folder, parameters for the tools, reports generation, etc...
This actually corresponds to the typical Snakemake 'config.yaml' file. You cantake a look at what serves as a template to createprofile/workflow_parameters.yaml. It is suggested torefer to the comments in this file for further details on individual parameters.

baseDmux takes as input a folder with internal ONT 'run' folders that each contains a 'fast5' folder. This is thetypical file hierarchy when sequencing with a MinION. baseDmux can therefore process a virtually unlimited number of (multiplexed) sequencing runs.

You can decide whether guppy and deepbinner should run on GPU or CPU by specifying 'RESOURCE' in theconfig.yaml file depending on the available computing hardware. Note that Deepbinner is not longermaintained and thatDeepbinner models are limited to specific 'earlier' flow cells and barcoding kits. One should therefore ensure that that Deepbinner is a adequate for the data at hand.

A typical usage case for baseDmux is to prepare filtered sequencing reads in individual fastq files for genomeassembly (or transcripts analysis) when users have a number of genomic DNA (or RNA) preparations sequenced with thesame library preparation protocol and flowcell typoe but over several runs with various sets of multiplex barcodes. For this, it is necessary to run the complete workflow.Note that they however currently need to share, if notidentical, at least 'compatible' (in the guppy sense), library construction kits and flow cell types.

Users need to prepare aBarcode by genome file. This is a roadmaptable for subseting fastq and fast5 reads, demultiplexed with guppy and/or deepbinner, and coming from disparateruns and barcodes, in bins corresponding to individual 'genomes' (or samples). It must contain at least thefollwing columns: Demultiplexer, Run_ID, ONT_Barcode, Genome_ID. Values in theGenome_ID correspond to thelabels of the bin into which reads will eventually be grouped.Make sure that these labels do NOT containspaces " " or other special characters like '|' '$' ':'. As separators, the safest options are to use "_" or "-".
Likewise,Run_ID values should not contain special characters. In addition, these values must match the names of thetop folders in the input fast5 directory.
Importantly, theBarcode by genome file does not only enable to group reads, it is necessary to provide such a filefor the porechop and filtlong rules to be executed. A template is provided (see the section below on configuration).

Appart from the workflow parameters, there are also additional parameter files that are required to specify Snakemakeinvocation arguments and, when baseDmux is run with a HPC scheduler, parameters regarding how specific jobs need tobe submited. All these configuration files are gathered inside a"profile" directory that can be automaticallyprototyped with the commands below. This is in line with the recommended way for Snakemake pipelinesconfiguration usingprofiles.

Generating template configuration files

To simplify configuration, thebaseDmux configure command generates a 'template' configuration profile for generaluse cases. These files can subsequently be modified to fit specific situations.

usage: baseDmux configure [-h] --mode {local,slurm,cluster,iTrop} [--barcodes_by_genome]                          [--edit [EDITOR]]                          dirpositional arguments:  dir                   path to the folder to contain config file and profile you want to createoptions:  -h, --help            show this help message and exit  --mode {local,slurm,cluster,iTrop}                        choose the mode of running Snakemake, local mode or cluster mode  --barcodes_by_genome  optional, create a tabular file containing information of barcodes for each                        genome)  --edit [EDITOR]       optional, open files with editor (nano, vim, gedit, etc.)

These files will be created:

  | dir          -| profile                    -| config.yaml                    -| workflow_parameter.yaml                    -| barcodesByGenome.tsv (if --barcodes_by_genome)                  -| ... (if mode slurm)

A 'profile' folder will be created and populated in the specifiddir path. The files may be modified and the wholefolder can be moved/copied/renamed anywhere as long as you use the correct path when you callbaseDmux run and updatethe enclosed files,config.yaml andworkflow_parameter.yaml for the new paths ofworkflow_parameter.yaml andbarcodesByGenome.tsv, respectively.

With the--barcodes_by_genome option, a formatted filebarcodesByGenome.tsv will be created (and its path appropriately specified inworkflow_parameter.yaml). One can then modify the information on the table accordingly. It is important that this table contains at least the same columns as those in the provided examplebarcodeByGenome_sample.tsv file.

Once you have adapted the templates for your typical use cases, there is no need to rerunbaseDmux configure again, just copy and adapt your existing templates.

Note: the 'iTRop' and 'cluster' modes areobsolete and will eventually be removed.

an exemple to prepare to run Snakemake locally (laptop, local node on a HPC)

Use this command:

baseDmux configure ./test_baseDmux --mode local --barcodes_by_genome

Thenworkflow_parameter.yaml andconfig.yaml will be created inside aprofile folder within./test_baseDmux../test_baseDmux/profile/config.yaml contains as a set of parameters for the Snakemake command-line.

an exemple to prepare a template to run Snakemake on a HPC with slurm.

Similarly, run the command below:

baseDmux configure ./test_baseDmux --mode slurm --barcodes_by_genome

This will create a./test_baseDmux/profile (see an examplehere)) folder with, inaddition to the already mentionned files, the necessary file templates to run basDmux with slurm and theSnakemake profile for configuring Snakemake to run on the SLURM Workload Manager.

For other HPC job managment system (sge, ...), and more information on Snakemake profile and other utilities refer totheSnakemake documentation and [this gitHub repository](https://github.com/Snakemake-Profiles).

Ultimately, the required files for passing HPC scheduler parameters throught the dedicated Snakemake mecanism of 'profiles' need to be stored in the folder whose path is passed to the baseDmuxprofile_dir parameter and willmost certainlyneed to be adapted to suit your specific needs.

Run the workflow with the created profile:

usage: baseDmux run [-h] [--snakemake_report] profile_dirpositional arguments:  profile_dir         profile folder to run baseDmuxoptions:  -h, --help          show this help message and exit  --snakemake_report  optionally, create snakemake report

Example:
You can runbaseDmux dryrun ./test_baseDmux/profile for dry-run to make sure that everything is OK, before actually executing the workflow.

baseDmux run ./test_baseDmux/profile

With the option--snakemake_report, a report filesnakemake_report.html will be created in the report folder of pipeline output directory, when snakemake has successfully finished the workflow.


Get your hands dirty and run a local test

Assuming the environement for baseDmux has been created as specified in the dedicated section on Installation. Firstactivate either the conda or venv environement.

You can use the reads fast5 files insample/reads_intermediate folder for testing and generate a localprofile directory.

## create configuration file for Snakemake and Snakemake profile,## and (optional) a tsv file containing information about genomes corresponding to barcode IDsmkdir ./test_baseDmuxbaseDmux configure ./test_baseDmux --mode local --barcodes_by_genome## copy sample reads to a test foldercp -r ./baseDmux/sample/reads_intermediate/ ./test_baseDmux/reads## check the workflow by dryrun, then runbaseDmux dryrun ./test_baseDmux/profilebaseDmux run ./test_baseDmux/profile

The output will be written in./test_baseDmux/results by defaultThe first run may take a long time for Singularity containers to be downloaded and the conda environments to be installed even if using Mamba.
On a personnal computer with only a few CPU, even with this very minimal dataset,guppy basecalling may also take several minutes... So be patient depending on your underlying computinginfrastructure capacities.

Input and Output

Input directorymust follow the structure as below. 'fast5' directory containing fast5 files in each run isMANDATORY for baseDmux to identifiy the various Run_ID(s).

indir/├── run_id1│   └── fast5│       ├── file_1.fast5│       ├── ...│       └── file_n.fast5├── ...└── run_idx

Output directory will be:

outdir/├── basecall│   ├── run_id1│   │   ├── sequencing_summary.txt│   │   └── {MinIONQC results}│   ├── ...│   ├── run_idx│   └── multiqc│       ├── multiqc_data│       └── multiqc_report.html├── demultiplex│   ├── deepbinner│   │   ├── run_id1│   │   │   ├── barcode01│   │   │   │   ├── barcode01.fastq.gz│   │   │   │   ├── fast5│   │   │   │   ├── sequencing_summary.txt│   │   │   │   └── {MinIONQC results}│   │   │   ├── ...│   │   │   ├── barcodexxx│   │   |   ├── classification│   │   |   ├── fast5_per_barcode.done│   │   |   ├── multiqc│   │   |   └── unclassified│   │   ├── ...│   │   └── run_idx│   └── guppy│       ├── run_id1│       │   ├── barcode01│       │   │   ├── barcode01.fastq.gz│       │   │   ├── fast5│       │   │   ├── sequencing_summary.txt│       │   │   └── {MinIONQC results}│       │   ├── ...│       │   ├── barcodexxx│       |   ├── barcoding_summary.txt│       |   ├── fast5_per_barcode.done│       |   ├── multiqc│       |   └── unclassified│       ├── ...│       └── run_idx├── reads_per_genome│   ├── fast5│   ├── fastq│   └── reads_per_genome.csv├── log│   ├── slurm│   └── snakemake└── report   ├── demultiplex_report.html   ├── demultiplex_report.RData   └── demultiplex_report.tsv

About

snakemake workflow for basecalling and demultiplexing of ONT sequencing data

Topics

Resources

License

Stars

Watchers

Forks

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp