beiko-lab/anankePublic

NotificationsYou must be signed in to change notification settings
Fork2
Star10

Ananke: Clustering marker gene data by time-series patterns

License

MIT license

10 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
ananke		ananke
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Repository files navigation

Ananke: Clustering marker gene data by time-series patterns

Ananke is a tool to cluster your marker gene data (such as 16S) by time-series abundance patterns. This creates groups of sequences that have shown the same changes in response to the environment. This tool facilitates the generation of the clusters, and the companionananke-ui application aids in the exploration of the results.

Installation

The following Python packages are required:

h5py >= 2.3.1
numpy >= 1.6
scipy >= 0.16.1
scikit-learn >= 0.16
pandas >= 0.17

To installAnanke, download thesource code, and in the extracted directory run:

python setup.py install

This can be run as root if you want to installAnanke globally.

To install locally (to your user directory, or in a Conda environment, once it is loaded) use the --user flag:

python setup.py install --user

Tutorials

Visit the Wiki for tutorials and examples:https://github.com/beiko-lab/ananke/wiki

Running ananke

Required input data

Ananke requires a minimal set of input data. It requires a FASTA sequence file and a metadata mapping file. Sequences should be quality filtered/trimmed/denoised using your preferred quality control methods before processing with Ananke. The required sequence file contains all sequences with the sample ID in the FASTA header, followed by a sequence count. Sequences must be on a single line. For example:

>Sample1_0ATGCGCATGCTATGCAT>Sample1_1ATCGAGCATCGATCGAC>Sample2_0AGCGATCGATCGATCGAT...

If your FASTA file has sequences split into multiple lines, the clever guys who madeSWARM have a little awk script to convert it to the required format:

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' amplicons.fasta > amplicons_linearized.fasta

The metadata file is a tab-separated sheet with the first column label as "#SampleID" and one of the metadata columns must contain the time offset. The metadata file from a QIIME analysis is compatible:

#SampleID time_pointsSample1 0Sample2 1Sample3 5Sample4 10...

Note that you may only have one sample per time-point.

Processing pipeline

There are several individual steps in the pipeline, listed below with theAnanke subcommand in brackets.

Tabulate unique sequences, creating am xn time-series matrix, wherem = number of unique sequences andn = number of time points (tabulate)
Filter out low information sequences (filter)
Cluster sequences using short time-series distance and DBSCAN (cluster)
(Optional) Add taxonomic information for the unique sequences (add taxonomy)
(Optional) Add sequence clustering (i.e., 97% OTU) information for the unique sequences (add sequence_clusters)

Step 1: Tabulating sequences

ananke tabulate -i sequence_input -o hdf5_output -f unique_sequence_output -m mapping_file -t column_name:

-i: input FASTA file location
-o: output HDF5 database file location
-f: output unique sequence FASTA file location
-m: input metadata file location (tab-separated)
-t: name of column that describes time-point offset

Example:

ananke tabulate -i sequences.fasta -o ananke_db.h5 -f seq.unique.fasta -m sequence_metadata.tsv -t time_points

Step 2: Filtering sequences

Sequences can be filtered in one of three ways:

proportion: if a sequuence accounts for less than threshold % of the data, filter it
abundance: if a sequence was seen less than threshold times in total, filter it
presence: if a sequence was seen in fewer than threshold % of time-points, filter it

Due to the magnitude of many high-throughput sequencing data sets, it is a good idea to filter to less than 20,000 unique sequences. Since the software currently calculates a completem xm distance matrix between sequences, this can take a lot of RAM for largem.

ananke filter -i hdf5_input -o hdf5_output -f filter_type -t filter_threshold:

-i: input HDF5 database file location
-o: output HDF5 filtered database file location
-f: filter type (one ofproportion,abundance, orpresence (default))
-t: threshold value

Example:

#Filter all unique sequences with < 100 occurrencesananke filter -i ananke_db.h5 -o ananke_db_filtered.h5 -f abundance -t 100

Step 3: Cluster sequences

Sequences are clustered by DBSCAN and results are placed into the existing HDF5 time-series database. Each data set is unique, and therefore will require clustering over a different range ofepsilon parameters. A good bet is to calculate from 0.1 to 100 in steps of 0.1. You should find the number of clusters (which are printed to screen during computation) increases to a point, and begins decreasing. This should occur well beforeeps=100, but the clustering will halt onceeps is large enough that everything is clustered into one group. Please note that small step sizes will take up a lot of disk space, as the results are all written to disk.

ananke cluster -i hdf5_input -n num_threads -l eps_lower_limit -u eps_upper_limit -s eps_step_size:

-i: input HDF5 database file location
-n: number of threads available for computing short time-series distances
-l:eps clustering parameter minimum value (must be strictly greater than 0)
-u:eps clustering parameter max value (will stop before this if data merges into a single cluster)
-s:eps clustering parameter step-size
-d: distance measure. Default is short time-series (sts). Other options are euclidean, cityblock, cosine, l1, l2, manhattan.Note: sts is the only measure that is computed with multithreading; other measures may be slower to compute.
--clr: boolean flag. Indicates whether to use the centre log ratio transform with count zero multplicative zero imputation within time-points/samples. If this flag is not given, the counts within time-points are converted to simple proportions.

Example:

ananke cluster -i ananke_db_filtered.h5 -n 8 -l 0.1 -u 100 -s 0.1

(Optional) Step 4 and 5: Add additional information to database

You can add taxonomic classification and sequence identity cluster information to the time-series database file, and these will appear in theananke-ui interface for contrasting with the time-series clusters.

You can use the unique sequences file from thetabulate step as the basis for this information. The labels in this file will match up with the labels used in the time-series database, allowing the information to be properly merged. The unique sequence file has size information contained in the FASTA header to enable it to work with clustering software like USEARCH/UPARSE. This must be removed from the resulting taxonomy and OTU cluster files by using sed:sed -i 's/;size=[0-9]*;//g' taxonomy.txt

A more thorough example of how to do this is givenin the wiki.

ananke add taxonomy -i hdf5_input -d taxonomy_dataananke add sequence_clusters -i hdf5_input -d sequence_cluster_data:

-i: input HDF5 database file location
-d: data file (e.g., tax_assignments.txt for taxonomy, or seq_otus.txt for sequence clusters)

Examples:

ananke add taxonomy -i ananke_db_filtered.h5 -d seq.unique_tax_assignments.txtananke add sequence_clusters -i ananke_db_filtered.h5 -d seq_otus.txt

If you already have an OTU clustering scheme you would like to use, it must be translated so that the labels are consistent. For this, you require the sequence FASTA file that you used as input to your clustering pipeline, and the resulting OTU cluster file (in QIIME's tab-separated format, see Output section here:http://qiime.org/scripts/pick_otus.html). The output of this script can be used as the input toananke add sequence_clusters.

ananke translate_clusters -i fasta_input -c clusters_input -o clusters_output

-i: input FASTA file used to generate your clusters
-c: input clustering scheme in tab-separated format (same format as QIIME outputs)
-o: desired output cluster file name

Example:

ananke translate_clusters -i seq.fasta -c seq_otus.txt -o seq_otus_translated.txt

About

Ananke: Clustering marker gene data by time-series patterns

Releases1

v0.3.0 Latest

Jul 10, 2017

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Ananke: Clustering marker gene data by time-series patterns

Installation

Tutorials

Running ananke

Required input data

Processing pipeline

Step 1: Tabulating sequences

Step 2: Filtering sequences

Step 3: Cluster sequences

(Optional) Step 4 and 5: Add additional information to database

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Languages

Movatterモバイル変換

License

beiko-lab/ananke

Folders and files

Latest commit

History

Repository files navigation

Ananke: Clustering marker gene data by time-series patterns

Installation

Tutorials

Running ananke

Required input data

Processing pipeline

Step 1: Tabulating sequences

Step 2: Filtering sequences

Step 3: Cluster sequences

(Optional) Step 4 and 5: Add additional information to database

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Languages

Packages