ksahlin/isONclustPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star66

De novo clustering of long transcript reads into genes

License

GPL-3.0 license

66 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
cemetary		cemetary
modules		modules
scripts		scripts
test		test
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
isONclust		isONclust
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

isONclust3 is now available and is much faster and typically more accurate than isONclust. We recommend usingisONclust3 instead, particularly if you want to cluster more than 10 million reads.

isONclust

isONclust is a tool for clustering either PacBio Iso-Seq reads, or Oxford Nanopore reads into clusters, where each cluster represents all reads that came from a gene. Output is a tsv file with each read assigned to a cluster-ID. Detailed information is available inpaper.

isONclust is distributed as a python package supported on Linux / OSX with python v>=3.4 as of version 0.0.2 and above (due to updates in python's multiprocessing library)..

INSTALLATION

Using conda

Conda is the preferred way to install isONclust.

Create and activate a new environment called isonclust

conda create -n isonclust python=3 pip source activate isonclust

Install isONclust

pip install isONclust

You should now have 'isONclust' installed; try it:

isONclust --help

Upon start/login to your server/computer you need to activate the conda environment "isonclust" to run isONclust as:

source activate isonclust

Using pip

To install isONclust, run:

pip install  isONclust

pip will install the dependencies automatically for you.pip is pythons official package installer and is included in most python versions. If you do not havepip, it can be easily installedfrom here and upgraded withpip install --upgrade pip.

Downloading source from GitHub

Dependencies

Make sure the below listed dependencies are installed (installation links below). Versions in parenthesis are suggested as isONclust has not been tested with earlier versions of these libraries. However, isONclust may also work with earliear versions of these libaries.

parasail
pysam (>= v0.11)

In addition, please make sure you use python version >=3.4. isONclust will not work with python 2.

With these dependencies installed. Run

git clone https://github.com/ksahlin/isONclust.gitcd isONclust./isONclust

Testing installation

You can verify successul installation by running isONclust on thissmall dataset. Simply download the test dataset and run:

isONclust --fastq [test/sample_alz_2k.fastq] --outfolder [output path]

USAGE

IsONclust can be used with either Iso-Seq or ONT reads. It takes either a fastq file or ccs.bam file.

Oxford Nanopore reads

isONclust needs a fastq file generated by an Oxford Nanopore basecaller.

isONclust --ont --fastq [reads.fastq] --outfolder [/path/to/output]

The argument--ont simply means--k 13 --w 20. These arguments can be set manually without the--ont flag. Specify number of cores with--t.

Iso-Seq reads

IsONclust works with full-lengh non-chimeric (flnc) reads that has quality values assigned to bases. The flnc reads with quality values can be generated as follows:

Make sure quality values is output when running the circular consensus calling step (CCS), by runningccs with the parameter--polish.
Run PacBio's Iso-Seq pipeline step 2 and 3 (primer removal and extraction of flnc reads)isoseq3.

Flnc reads can be submitted as either a fastq file or bam file. A fastq file is created from a BAM by runninge.gbamtools convert -format fastq -in flnc.bam -out flnc.fastq. isONclust is called as follows

isONclust --isoseq --fastq [reads.fastq] --outfolder [/path/to/output]

isONclust also supports older versions of the isoseq3 pipeline by taking theccs.bam file together with theflnc.bam. In this case, isONclust can be run as follows.

isONclust --isoseq --ccs [ccs.bam] --flnc [flnc.bam] --outfolder [/path/to/output]

Where<ccs.bam> is the file generated fromccs and<flnc.bam> is the file generated fromisoseq3 cluster. The argument--isoseq simply means--k 15 --w 50. These arguments can be set manually without the--isoseq flag. Specify number of cores with--t.

Output

Clustering information

The output consists of a tsv filefinal_clusters.tsv present in the specified output folder. In this file, the first column is the cluster ID and the second column is the read accession. For example:

0 read_X_acc0 read_Y_acc...n read_Z_acc

if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first).

Cluster fastq files

You can obtain separate cluster fastq files from the clustering by running

isONclust write_fastq --clusters [/path/to/output/]final_clusters.tsv --fastq [reads.fastq] --outfolder [/path/to/fastq_output] --N 1

CREDITS

Please cite [1] when using isONclust.

Kristoffer Sahlin, Paul Medvedev. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm, Journal of Computational Biology 2020, 27:4, 472-484.Link.

Here is an open access version of the paper:bioRxiv link.

Bib record

@article{sahlin2020a,author = {Sahlin, Kristoffer and Medvedev, Paul},title = {De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm},journal = {Journal of Computational Biology},volume = {27},number = {4},pages = {472-484},year = {2020},doi = {10.1089/cmb.2019.0299},note ={PMID: 32181688},URL = {https://doi.org/10.1089/cmb.2019.0299},eprint = {https://doi.org/10.1089/cmb.2019.0299},abstract = { Long-read sequencing of transcripts with Pacific Biosciences (PacBio) Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (to scale) and makes use of quality values (to handle variable error rates). We test isONclust on three simulated and five biological data sets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large data sets. }}

LICENCE

GPL v3.0, seeLICENSE.txt.

About

De novo clustering of long transcript reads into genes

Languages

Python100.0%

Movatterモバイル変換

License

ksahlin/isONclust

Folders and files

Latest commit

History

Repository files navigation

isONclust

Table of Contents

INSTALLATION

Using conda

Using pip

Downloading source from GitHub

Dependencies

Testing installation

USAGE

Oxford Nanopore reads

Iso-Seq reads

Output

Clustering information

Cluster fastq files

CREDITS

Bib record

LICENCE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages