LUMC/hutspotPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star3

Multisample DNAseq variant calling pipeline for use in diagnostics

License

AGPL-3.0 license

3 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 508 Commits
.github		.github
cluster		cluster
config		config
img		img
slurm-test		slurm-test
src		src
tests		tests
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
common.smk		common.smk
environment.yml		environment.yml

Repository files navigation

Hutspot

This is a multi sample DNA variant calling pipeline based on Snakemake, bwa andthe GATK HaplotypeCaller.

Features

Any number of samples is supported
Whole-genome calling, regardless of wet-lab library preparation.
Follows modern best practices
- Each sample is individually called as as a GVCF.
- A VCF is then produced by genotyping the individual GVCFs separatelyfor each sample.
Data parallelization for calling and genotyping steps.
- Using thescatter_size setting in the configuration file, the referencegenome is split into chunks, and each chunk can be processedindependenly. The default value of 1 billon will scatter the humanreference genoom into 6 chunks.
Reasonably fast.
- 96 exomes in < 24 hours.
No unnecessary jobs
Calculate coverage metrics if abedfile is specified.
Fully containerized rules through singularity and biocontainers. Legacyconda environments are no long available.

Installation

This repository contains acondaenvironment file that you can use to install all dependencies in aconda environment:

conda env create -f environment.yml

Singularity

We highly recommend the user of the containerized rules throughsingularity.

This option does require you to install singularity on your system. As thisusually requires administrative privileges, singularity is not containedwithin our provided conda environment file.

If you want to use singularity, make sure you install version 3 or higher.

Debian

If you happen to use Debian buster, singularity 3.0.3 comes straight outof the box with a simple:

sudo apt install singularity-container

Docker

You can run singularity within a docker container. Please note thatthe containerMUST run in privileged mode for this to work.

We have provided our own container that includes singularity and snakemakehere.

Manual install

If you don't use Debian buster and cannot run a privileged docker container,you - unfortunately :-( - will have to install singularity manually.Please see the installation instructionshere on howto do that.

Operating system

Hutspot was tested on Ubuntu 16.04 only.It should reasonably work on most modern Linux distributions.

Requirements

For every sample you wish to analyze, we require one or more paired endreadgroups in fastq format. They must be compressed with eithergzip orbgzip.

The configuration must be passed to the pipeline through a configuration file.This is a json file listing the samples and their associated readgroupsas well as the other settings to be used.An example config json can be foundhere, and ajson schema describing the configuration file can be foundhere.This json schema can also be used to validate your configuration file.

Reference files

The following reference filesmust be provided in the configuration:

reference: A reference genome, in fasta format. Must be indexed withsamtools faidx.
dbsnp: A dbSNP VCF file
known_sites: One ore more VCF files with known sites for baserecalibration

The following reference filesmay be provided:

targetsfile: Bed file of the targets of the capture kit. Used to calculate coverage.
baitsfile: Bed file of the baits of the capture kit. Used to calculate picard HsMetric.
refflat: A refFlat file to calculate coverage over transcripts.
scatter_size: Size of the chunks to split the variant calling into.
female_threshold: Fraction of reads between X and the autosomes to call asfemale.

How to run

After installing and activating the main conda environment, as described above,the pipeline can be started with:

snakemake -s Snakefile \--use-singularity \--configfile tests/data/config/sample_config.json

This would start all jobs locally. Obviously this is not what one wouldregularly do for a normal pipeline run. How to submit jobs on a cluster isdescribed later. Let's first move on to the necessary configuration values.

Configuration values

The required and optional outputs are specified in the json schema located inconfig/schema.json. Before running, the content of the--configfile isvalidated against this schema.

The following configuration values arerequired:

configuration	description
`reference`	Absolute path to fasta file
`samples`	One or more samples, with associated fastq files
`dbsnp`	Path to dbSNP VCF file
`known_sites`	Path to one or more VCF files with known sites. Can be the same as the`dbsnp` file

The following configuration options areoptional:

configuration	description
`targetsfile`	Bed file of the targets of the capture kit. Used to calculate coverage
`baitsfile`	Bed file of the baits of the capture kit. Used to calculate picard HsMetrics
`female_threshold`	Float between 0 and 1 that signifies the threshold of the ratio between coverage on X/overall coverage that 'calls' a sample as female. Default = 0.6
`scatter_size`	The size of chunks to divide the reference into for parallel execution. Default = 1000000000
`coverage_threshold`	One or more threshold coverage values. For each value, a sample specific bed file will be created that contains the regions where the coverage is above the threshold
`restrict_BQSR`	Restrict GATK BaseRecalibration to a single chromosome. This is faster, but the recalibration is possibly less reliable
`multisample_vcf`	Create a true multisample VCF file, in addition to the regular per-sample VCF files

Cluster configuration

To run on a cluster, snakemake needs to be called with some extra arguments.Additionally, it needs a cluster yaml file describing resources per job.

If you run on a cluster with drmaa support,an environment variable namedDRMAA_LIBRARY_PATH must be in the executing shell environment. This variablepoints to the.so file of the DRMAA library.

An sge-cluster.yml is bundled with this pipeline in the cluster directory.It is optimized for SGE clusters, where the default vmem limit is 4G.If you run SLURM, or any other cluster system, you will have to write your owncluster yaml file. Please see thesnakemake documentationfor details on how to do so. Given the provided sge-cluster.yml, activating thecluster mode can be done as follows:

snakemake -s Snakefile \--cluster-config cluster/sge-cluster.yml--drmaa' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \

Limitations

Sample names should be unique, and not overlap (such assample1 andsample10). This is due to the way output files are parsed by multiQC,when sample names overlap, the json output for picard DuplicationMetrics cannotbe parsed unambiguously.

Binding additional directories under singularity

In singularity mode, snakemake binds the location of itself in the container.The current working directory is also visible directly in the container.

In many cases, this is not enough, and will result inFileNotFoundErrors.E.g., suppose you run your pipeline in/runs, but your fastq files live in/fastq and your reference genome lives in/genomes. We would have to bind/fastq and/genomes in the container.

This can be accomplished with--singularity-args, which accepts a simplestring of arguments passed to singularity. E.g. in the above example,we could do:

snakemake -S Snakefile \--use-singularity  \--singularity-args' --bind /fastq:/fastq --bind /genomes:/genomes'

Summing up

To sum up, a full pipeline run under a cluster would be called as:

snakemake -s Snakefile \--use-singularity \--singularity-args' --bind /some_path:/some_path' \--cluster-config cluster/sge-cluster.yml \--drmaa' -pe <PE_NAME> {cluster.threads} -q all.q -l h_vmem={cluster.vmem} -cwd -V -N hutspot' \--rerun-incomplete \--jobs 200 \-w 120 \--max-jobs-per-second 30 \--restart-times 2 \--configfile config.json

Graph

Below you can see the rule graph of the pipeline. The main variant calling flowis highlighted in red. This only shows dependenciesbetween rules, and not between jobs. The actual job graph is considerablymore complex, as nearly all rules are duplicated by sample and some(the scatter jobs) additionally by chunk.

As a rough estimate of the total number of jobs in pipeline you can usethe following formula:

$n_{jobs} = 4+(21*n_{samples})+(n_{samples}*n_{beds})+(n_{samples}*n_{chunks})+n_{chunks}$

This gives about 12,000 jobs for a 96-sample run with 2 bed files and 100 chunks.

NOTE: the graph will only render if your markdown viewer supportsplantuml.Having trouble viewing the graph? Seethis static SVG instead.

digraphsnakemake_dag {graph[bgcolor=white,margin=0];    node[shape=box,style=rounded,fontname=sans,fontsize=10,penwidth=2];edge[penwidth=2,color=grey];0[label= "all",color= "0.300.60.85",style="rounded"];1[label= "multiqc",color= "0.600.60.85",style="rounded"];2[label= "merge_stats",color= "0.170.60.85",style="rounded"];3[label= "bai",color= "0.090.60.85",style="rounded"];4[label= "genotype_gather\nsample: micro", color = "0.06 0.6 0.85",];5[label= "gvcf_gather\nsample: micro", color = "0.32 0.6 0.85",];6[label= "fastqc_raw\nsample: micro", color = "0.00 0.6 0.85",];7[label= "fastqc_merged",color= "0.110.60.85",style="rounded"];8[label= "fastqc_postqc",color= "0.020.60.85",style="rounded"];9[label= "stats_tsv",color= "0.450.60.85",style="rounded"];10[label= "collectstats",color= "0.240.60.85",style="rounded"];11[label= "vcfstats\nsampel: micro", color = "0.52 0.6 0.85",];12[label= "markdup",color= "0.470.60.85",style="rounded"];13[label= "scatterregions",color= "0.560.60.85",style="rounded"];14[label= "merge_r1\nsample: micro", color = "0.65 0.6 0.85",];15[label= "merge_r2\nsample: micro", color = "0.26 0.6 0.85",];16[label= "cutadapt",color= "0.220.60.85",style="rounded"];17[label= "fqcount_preqc",color= "0.370.60.85",style="rounded"];18[label= "fqcount_postqc",color= "0.580.60.85",style="rounded"];19[label= "mapped_reads_bases",color= "0.430.60.85",style="rounded"];20[label= "unique_reads_bases",color= "0.340.60.85",style="rounded"];21[label= "fastqc_stats",color= "0.130.60.85",style="rounded"];22[label= "covstats",color= "0.390.60.85",style="rounded"];23[label= "align",color= "0.490.60.85",style="rounded"];24[label= "create_markdup_tmp",color= "0.410.60.85",style="rounded,dashed"];25[label= "sickle",color= "0.190.60.85",style="rounded"];26[label= "genome",color= "0.620.60.85",style="rounded"];1->02->03->04->05->06->07->08->09->110->211->212->313->413->514->715->716->82->917->1018->1019->1020->1021->1022->104->1123->1224->1225->1614->1715->1716->1823->1912->207->218->2112->2226->2216->2324->2314->2515->25}

LICENSE

AGPL-3.0

About

Multisample DNAseq variant calling pipeline for use in diagnostics

Releases3

Version 1.2.0 Latest

Jun 21, 2019

+ 2 releases

Packages

No packages published

Contributors3

Languages

Python100.0%

Movatterモバイル変換

License

LUMC/hutspot

Folders and files

Latest commit

History

Repository files navigation

Hutspot

Features

Installation

Singularity

Debian

Docker

Manual install

Operating system

Requirements

Reference files

How to run

Configuration values

Cluster configuration

Limitations

Binding additional directories under singularity

Summing up

Graph

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages