Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

License

NotificationsYou must be signed in to change notification settings

vezzi/NouGAT

Repository files navigation

De novo assembly is certainly one of the most difficult tasks of today genomics. On the one hand new sequencing technologies have given the ability to the research community to sequence an increasing number of genome, however, on the other hand, the high number of tools, the difficulties in evaluating the results, and the need of high computational resources makes de novo assembly of genomes still an holy grail.

NouGAT, NGI open universal Genome Assembly Toolbox, is a pipeline that allows, to a certain extent, to automate some of the most complex common processes that take place during the analysis of a de novo assembly project. The pipeline aims to generate a first draft assembly that can be used as a first step towards the production of a better assembly or in order to draw the first biological conclusion.

The pipeline is structured in three different sub-pipelines, and a number of conveniencescripts aimed only for NGI users running their analysis in the UPPMAX environment.

Table Of Contents

Installation

NouGAT is a python package. From a practical point of view it is a wrapper around a certain number of tools that allow to perform de novo assembly analysis. We recommend creating a Conda virtual environment for installing NouGAT:

# First we create a new virtual python environment using conda, get it here:# http://docs.continuum.io/anaconda/install.html#linux-installconda create -n DeNovoPipeline anaconda# Install NouGAT in the new environmentsource activate DeNovoPipelinegit clone https://github.com/SciLifeLab/NouGAT.gitcd NouGAT&& python setup.py develop

NGI / UPPMAX users only: See the separateREADME on how to install the NGI specific scripts.

Third party tools

NouGAT is a wrapper around several tools. The tools needs to be installed and it is highly recommended to have them available on the path. The following table shows the supported tools and their versions:

ToolVersionFunction
FastQC0.11.2Read pre-processing
Trimmomatic0.30Read pre-processing
bwa> 0.7.4Read alignment
picard-tools< 1.124Read alignment
SAMtools1.1.19Read alignment
KmerGenie1.6741Read pre-processing
ABySS1.3.5Assembler
ALLPATHS-LG>= 47918Assembler
CABOG8.1Assembler
MaSuRCA2.3.2Assembler
SOAPdenovo2.04-r240Assembler
SPAdes>= 3.0.0Assembler
Trinity>= 2.0.2Assembler
FRC_alignhttps://github.com/vezzi/FRC_alignAssembly evaluation
qaToolshttps://github.com/vezzi/qaToolsAssembly evaluation

Note! Not all tools needs to be installed, only the tools that one plans to use. In order to use tools other than the ones in the list, a new wrapper function needs to be be written.

Also note! The versions given are the ones tested by the authors, however others might work if the commandline and/or outputs have not changed significantly (e.g. trinity pre-2.0.2)

Configuration files

The pipeline needs two configuration files to be run:

  • Global configuration: this configuration file contains a description of the sub-pipelines and links to the tools. The repository contains predefined global configurations for milou, nestor, amanita, and picea (n.b. some of the path might still point to my home)
  • Sample configuration: this describes the samples to be assembled/analysed

Global configuration file

This file has two main sections: “Pipelines” and “Tools”. The former section describes the pipelines implemented and lists which tools can be used in the pipeline. If pipeline A can use tools T1, T2, and T3 then only these tools can be specified when calling pipeline A, moreover tools T1, T2, and T3 need to be properly installed.

"Tools" section contains an entry for each tool, for each tool thebin field must contain the path to the tool (n.b., some tools require the directory where the tool is installed, other require the binary, other require that all the commands are correctly present in the path). Inoptions, commandline parameters can be given (where supported), see the the documentation for each individual tool.

Pipelines:QCcontrol:["fastqc", "abyss", "trimmomatic", "align"]Tools:fastqc:bin:/sw/apps/bioinfo/fastqc/0.10.1/milou/fastqcoptions:[--threads,  "16" ,  --outdir,  fastqc]abyss:...

For a complete example of this file, see theNGI-specific configuration.

Sample configuration file

Sample configuration file specifies which pipeline need to be run (one of who is present in the global configuration file), which tools are run and in which order. It must be kept in mind that the order in which tools are run can deeply change the results. Tools cannot be run more than once (i.e., if a tool is present more than once the tool will be run only the first time).

The sample configuration files contains a number of fields (not all mandatory) to be specified, plus a section to describe the library(ies) to be analysed. More in details:

pipeline:# This field specifies which pipeline needs to be run. Only implemented pipelines can be run.PIPELINE_TO_BE_RUN# Which tools need to be used, and in which order.tools:[T1,T2,T3,...]# Prefix to append to all output filesoutput:OUTPUT_NAME# Minimum contig length to be considered. This parameter is used in several steps in order to# discard short contigs (default 1000)minCtgLength:MIN_CTG_LGTH# Expected genome size. Used by some assemblers, if unsure, give a rough estimate.genomeSize:EXP_GENOME_SIZE# Number of threads to be used in parallel steps.threads:NUM_THREADS# In case a tool needs a predefined kmer size (Rule-of-thumb, try 61 first)kmer:KMER# This field is mandatory when an alignment needs to be executed, e.g, the tools in the evaluation# pipeline use the .bam alignments to work.reference:PATH_TO_REFERENCE

The next section of sample configuration file is 'libraries' and contains a description of the libraries and paths to the fastq files. Each entry (lib1, lib2, etc.) contains the following mandatory fields:

libN:pair1:PATH_TO_PAIR_1# Path to first pairpair2:PATH_TO_PAIR_2# Path to second pair. Leave this blank in case of single-ended lib.orientation:PAIR_ORIENTATION# Pair read orientation (innie or outtie)insert:INSERT_SIZE# insert size (expected)std:STANDARD_DEVIATION# standard deviation of the insert size  (expected)

It is important to note that (despite the name) lib1, lib2, … identify different sequencing runs. In reality the concept of “library” is represented by the insert size, i.e., library entries libi and libj withi not equal toj are considered to be part of the same library if and only if the insert size is the same.

Example file

Sample configuration for the assemble pipeline, note that no reference is specified as it is not needed. This is an example of a typical NGI Stockholm de novo project (J.Dohe) that is split into two projects (J.Dohe_14_01 and J.Dohe_14_02) the former being the paired end, and the latter being the mate pair library. This is often called the “allpaths recipe” as this is the assembly strategy suggested by BROAD for their ALLPATHS-LG assembler.

pipeline:assembletools:[allpaths]output:J.DoheminCtgLength:2000genomeSize:450000000threads:16kmer:54reference:libraries:lib1:pair1:/proj/a2010002/INBOX/J.Dohe_14_01/P101/HISEQ_RUN/read_1.fastq.gzpair2:/proj/a2010002/INBOX/J.Dohe_14_01/P101/HISEQ_RUN/read_2.fastq.gzorientation:innieinsert:180std:30lib2:pair1:/proj/a2010002/INBOX/J.Dohe_14_02/P101/HISEQ_RUN/read_1.fastq.gzpair2:/proj/a2010002/INBOX/J.Dohe_14_02/P101/HISEQ_RUN/read_2.fastq.gzorientation:outtieinsert:3000std:500

The pipelines

NouGAT currently implements 3 different pipelines named:

  • QCcontrol
  • assemble
  • evaluate

There is also a fourth pipeline,align, which is a sort of dummy pipeline to align reads against a reference. The align pipeline can be run as a part of other pipeline, this makes it easier to specify custom align pipelines, e.g. one using something other than BWA.

Quality control

The pipeline is for pre-processing the sequenced reads, i.e. to ascertain the quality of the sequencing experiment and inform decisions to be made for the assembly step. An example would be to run the following tools in succession:

  • Trimmomatic, to remove adapter read-through and low quality ends.
  • FastQC, to give easy to read quality checks for Q-values, read lengths, etc.
  • Abyss. This is for generating k-mer counts from the reads. The pipeline will automatically plot these.

Also useful is to runalign if a reference assembly for the organism exists. The pipeline will pull useful statistics from the .bam aligned files using samtools and picard, e.g. estimated insert size and duplication rate metrics.

Assemble

The assembly pipeline will sequentially execute the assembly tools specified in the sample config file. From our earlierJ.Dohe example, we in addition want to run ABySS and SOAPdenovo before ALLPATHS-LG:

pipeline:assembletools:[abyss, soapdenovo, allpaths]output:J.Dohe...

Evaluate

This pipeline is for evaluating the relative quality of the assemblies produced. This includes the standard contiguity metrics (N50, N80, max contig length, etc.) and estimates of the level of mis-assembly. It is recommended to run these tools in the following order:

  • Align
  • qaTools, to visualise ctg. length vs. coverage vs. GC content, etc.
  • FRC, feature response curves to estimate the relative levels of mis-assembly.

Note! FRC only accepts two libraries so we recommend to choose one (overlapping) paired-end and one long insert (mate pair) library.

NGI specific scripts

Describedhere.

How to get started

Coming soon.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp