ggonnella/fastsubtreesPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star4

Python library and command line script , for fast extraction of subtrees of fairly large trees, consisting of millions of nodes, such as the NCBI taxonomy tree.

License

ISC license

4 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 483 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
bin		bin
data		data
docker		docker
docs		docs
fastsubtrees		fastsubtrees
genomes_attributes_viewer		genomes_attributes_viewer
ntdownload		ntdownload
ntmirror		ntmirror
ntsubtree		ntsubtree
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
AUTHORS.txt		AUTHORS.txt
CHANGELOG.txt		CHANGELOG.txt
CITATION.cff		CITATION.cff
CONTRIBUTORS.txt		CONTRIBUTORS.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Fastsubtrees

Fastsubtrees is a Python library and a command line script, for handling fairlylarge trees (in the order of magnitude of millions nodes), in particularallowing the fast extraction of any subtree.The main application domain offastsubtrees is working with the NCBI taxonomytree, however the code is implemented in a generic way, so that otherapplications are possible.

The library functionality can be accessed both from inside Python codeand from the provided command line toolfastsubtrees.

Introduction

For the use offastsubtrees, nodes must be uniquely identified by non-negative IDs.Furthermore, the space of the IDs must be compact (i.e. the maximum ID should not bemuch larger than the number of IDs).

The first step when usingfastsubtrees is to construct a tree representation.The operation requires a source of IDs of elements and their parents, which can bea tabular file, or any Python function yielding the IDs.

This operation just takes a few seconds, for a tree with million nodes, such as the NCBI taxonomy tree.It must be done only once, if a tree does not change, since the resulting datais stored to file.

The IDs of the NCBI taxonomy tree fullfill the conditions stated above. However, the librarycan be used for any tree. A way to use the library with IDs which do not fullfill the conditions,it to map them to an ID space which does, and store the original IDs as an attribute.

Besides the IDs, a tree can contain further information, e.g. integers, floats or otherdata, here called attributes, associated to the nodes. Each node can contain zero, one or more valuesfor an attribute. To add values for an attribute, a tabular file or another datasource (a Python function) is selected.

The data for any subtree can then be easily and efficently queried; thereby the node IDs and/otherselected attributes can be retrieved.

The tree representation is dynamic, i.e. both the tree topology and the attribute values can beedited and changed.

Working with the library

Installation

The package can be installed usingpip install fastsubtrees.

Command line interface

The command line toolfastsubtrees allows constructing and modifying a tree(subcommandtree), adding and editing attributes (subcommandattribute)and performing a subtree query (subcommandquery).

The command line interface is further described in theCLI manual.

CLI example: working with the NCBI taxonomy tree

The example below uses thefastsubtrees command, as well as thentdownload library(installed as a dependency, bypip) for obtaining the NCBI taxonomy data.

ntdownload ntdumps                                     # download NCBI taxonomy datafastsubtrees tree nt.tree --ncbi ntdumps/nodes.dmp -f  # create the treefastsubtrees query nt.tree 562                         # query node 562# attributesATTRTAB=data/accession_taxid_attribute.tsv.gz          # data fileTAXID=2; GENOME_SIZE=3; GC_CONTENT=4                   # column numbers, 1-basedfastsubtrees attribute nt.tree genome_size $ATTRTAB -e $TAXID -v $GENOME_SIZE -t intfastsubtrees attribute nt.tree GC_content $ATTRTAB -e $TAXID -v $GC_CONTENT -t floatfastsubtrees query nt.tree 562 genome_size GC_content  # query including attributes# taxonomy namesntnames ntdumps >| names.tsv                           # prepare data from names dumpfastsubtrees attribute nt.tree taxname names.tsv       # add names as attributefastsubtrees query nt.tree 562 taxname genome_size     # query including taxa names

Using NtSubtree

The packagentsubtree (installable bypip) simplifies working with the NCBI taxonomy even more.Tree and the taxonomic names tables are automatically created and stored in a central location.

# first run after installing automatically downloads and constructs the treentsubtree query 562               # taxonomic names displayed alongside the IDsntsubtree query -n "Escherichia"  # Query by taxonomic name# attributesATTRTAB=data/accession_taxid_attribute.tsv.gz          # data fileTAXID=2; GENOME_SIZE=3; GC_CONTENT=4                   # column numbersntsubtree attribute genome_size $ATTRTAB -e $TAXID -v $GENOME_SIZEntsubtree attribute GC_content $ATTRTAB -e $TAXID -v $GC_CONTENTntsubtree query -n "Escherichia" genome_size GC_content# check if a newer version of the taxonomy data is available# and update the tree if necessary, keeping the attribute values:ntsubtree update

API

The library functionality can be also directly accessed in Python code usingthe API, which is documented in theAPI manual.

API example: working with the NCBI taxonomy tree

The example below uses thefastsubtrees command, as well as thentdownload library(installed as a dependency, bypip) for obtaining the NCBI taxonomy data.

# download the NCBI taxonomy datafromntdownloadimportDownloaderd=Downloader("ntdumpsdir")has_downloaded=d.run()fromfastsubtreesimportTreeinfile="ntdumpsdir/nodes.dmp"tree=Tree.construct_from_ncbi_dump(infile)# create the treeresults=tree.subtree_ids(562)# retrieve subtree IDsattrtab="data/accession_taxid_attribute.tsv.gz"# data filetaxid_col=1;genome_size_col=2;gc_content_col=3# column numbers, 0-basedtree.to_file("nt.tree")tree.create_attribute_from_tabular("genome_size",attrtab,elem_field_num=taxid_col,attr_field_num=genome_size_col,casting_fn=int)tree.create_attribute_from_tabular("GC_content",attrtab,elem_field_num=taxid_col,attr_field_num=gc_content_col,casting_fn=float)results=tree.subtree_info(562, ["genome_size","GC_content"])# taxonomy namesfromntdownloadimportyield_scientific_names_from_dumpasgeneratortree.create_attribute("taxname",generator("ntdumpsdir"))results=tree.subtree_info(562, ["taxname","genome_size"])

Using NtSubtree

The packagentsubtree (installable bypip) simplifies working with the NCBI taxonomy even more.Tree and the taxonomic names tables are automatically created and stored in a central location.The first time the library is included these operations are done automatically.

importntsubtreetree=ntsubtree.get_tree()results=tree.subtree_ids(562)taxid=ntsubtree.search_name("Escherichia")results=tree.subtree_info(taxid, ["taxname"])attrtab="data/accession_taxid_attribute.tsv.gz"# data filetaxid_col=1;genome_size_col=2;gc_content_col=3# column numbers, 0-basedtree.create_attribute_from_tabular("genome_size",attrtab,elem_field_num=taxid_col,attr_field_num=genome_size_col,casting_fn=int)tree.create_attribute_from_tabular("GC_content",attrtab,elem_field_num=taxid_col,attr_field_num=gc_content_col,casting_fn=float)results=tree.subtree_info(562, ["genome_size","GC_content"])# check if a newer version of the taxonomy data is available# and update the tree if necessary, keeping the attribute values:ntsubtree.update()

Docker

To try or test the package, it is possible to usefastsubtreesby employing the Docker image defined inDockerfile.This does not require any external database installation and configuration.

Example of the Docker command line:

# create a Docker imagedocker build --tag "fastsubtrees" .# create a container and run itdocker run -p 8050:8050 --detach --name fastsubtreesC fastsubtrees# or, if it was already created and stopped, restart it using:# docker start fastsubtreesC# run the testsdocker exec fastsubtreesC tests# run benchmarksdocker exec fastsubtreesC benchmarks# run the example applicationdocker exec fastsubtreesC start-example-app# now open it in the browser at https://0.0.0.0:8050

Tests

To run the test suite, you can usepytest (ormake tests).The tests include tests offastsubtrees and of the sub-packagentmirror.The latter are partly dependent on a database installation and configurationwhich must be given inntmirror/tests/config.yaml;database-dependent tests are skipped if this configuration file is not provided.

The entire test suite can be also run from the Docker container,without further configuration, see above theDocker section.

Benchmarks

Benchmarks can be run using the shell scripts provided underbenchmarks.These require data, which is downloaded from NCBI taxonomy andsome pre-computed example data which is provided in thedata subdirectory(genome sizes and GC content).

The benchmarks can be convienently run from the Docker container, withoutrequiring a database installation and setup, see above theDocker section.

Example application: Genome attributes viewer

An interactive web application based onfastsubtrees was developed usingdash. It allows to graphically display the distribution of values ofattributes in subtrees of the NCBI taxonomic tree.It is a separate Python package, which canbe installed usingpip, and depends onfastsubtrees.

It can also be installed using the Docker image offastsubtrees (see above in theDocker section).

For more information see also thegenomes-attributes-viewer/README.md file.

Local installation and startup

To application can be installed usingpip install genomes_attributes_vieweror from thegenomes_attributes_viewer directory of thefastsubtreesrepository.

To start the application, use thegenomes-attributes-viewer.The first time this command is run, the application data are downloaded andprepared, taking a few seconds. Startup on subsequentstarts does not require these operations and is thus faster.

Other subpackages

NtSubtree

NtSubtree is a library which automatically downloads the NCBI taxonomydump and constructs thefastsubtrees data for it. It allows to easilykeep the data up-to-date. It is a separate Python package, which canbe installed usingpip, and depends onfastsubtrees.

Thequery command of the NtSubtree CLI tool automaticallydisplay also taxonomic names, alongside the IDs in query and allow toperform queries by taxonomic name.

For more information see also thentsubtree/README.md file.

ntdownload

When working with the NCBI taxonomy database, a local copy of the NCBI taxonomydump can be obtained and kept up-to-date using thentdownload package, whichis located in the directoryntdownload. It is a separatePython package, which can be installed usingpip, independentlyfromfastsubtrees.

Please refer to the user manual ofntdownload located underntdownload/README.mdfor more information.

ntmirror

A downloaded NCBI taxonomy database dump can be loaded toa local SQL database, using the packagentmirror, which is locatedin the directoryntmirror.It is a separate Python package, which canbe installed usingpip, independently fromfastsubtrees.

It contains also a script to extract subtreesfrom the local database mirror using hierarchical SQL queries.

Please refer to the user manual ofntmirror located underntmirror/README.mdfor more information.

Internals

For achieving an efficient running time and memory use, the nodes of the treeare represented compactly in deep-first traversal order.Subtrees are then extracted in O(s) time, where s is the size of the extractedsubtree (i.e. not depending on the size of the whole tree).

The IDs must notnecessarily be all consecutive (i.e. some "holes" may be present), but thelargest node ID (idmax) should not be much larger than the total number ofnodes, because the memory consumption is inO(idmax).

For each attribute defined in a tree, a file is created, where the attributevalues are stored. The attributes are also stored in the same deep-first traversalorder as the tree IDs.

Tree construction algorithm

The tree construction algorithm used here is the following,where the input data consists of 2-tuples(element_id, parent_id)and the maximum node ID m is not much larger than the number of IDs n.

iteration over the input data to construct a tableP of parents by ID,i.e.P[element_id]=parent_id ifelement_id is in the tree,andP[element_id]=UNDEF if not, where UNDEF is a special value.This requiresO(n) stepsfor reading the IDs andO(m) steps for writing either the ID or theUNDEFvalue toP; sincem>=n, the total time is inO(m).
iteration over tableP to construct a tableS of subtree sizesby ID; for each element the tree is climbed to the root, to add theelement to the counts of each ancestor. Thisoperation requiresO(n*d) time, whered is the height of the node,which is in average case much lower thanm andd=m is the worst case.
iteration over each node ID to construct the listD, consisting of the depth first order ofthe nodes, and the tableC of the coordinates of all nodes in the tree data, by id.For this operation, first the root is added toD andC, thenfor each other nodex inP, the tree is climbed and nodes added to a stack until the next not-yet-addedancestor is found. The position where to add it this node is computed by the nextfree position in the subtree of its parent (which must have been already added,by definition, thus the next free position in its subtree is known). After this,the next stack element is added, untilx is added.Although this operation also requires climbing the tree, it takes in totalO(n) time,since each node is added only once.

Parallelizing the tree construction

Currently the slowest step of the construction, detailed in the previoussection, is the second, i.e. the computation ofS.Since each node must be count in the subtree size of all its ancestors,there is no easy way to reduce the time fromO(n*h).

To parallelize this step, one divides the parents table intot slices,and assign each to a different sub-process (not thread, because of the GIL).Each sub-process would then count the subtree sizes in the slice only.A version implemented with a shared table and a lock was too slow,since access to the table was concurrent among the sub-processes.In the current version, instead, each sub-process makes a own subtree sizesS' table. The sub-processesS' tables are summed up after completion forobtaining theS table.

This option can be activated in the CLI using the--processes P option,or in the API setting thenprocesses argument ofTree.construct andrelated methods. Benchmarks show that the parallel version did not significantlyimprove the performance on constructing the NCBI taxonomy tree, likelybecause of the overhead of process starting, arrayS' initializationand summing up of allS' toS after completion.

Community guidelines

Contributions to the software are welcome. Please clone this repositoryand send a pull request on Github, to let the changes be integrated inthe original repository.

In case of bugs and issues, please report them through the Github Issues pageof the repository.

Documentation

The complete documentation of Fastsubtrees is available on ReadTheDocsathttps://fastsubtrees.readthedocs.io/ in website andPDF format.

Licence

All code of Fastsubtrees is released under the ISC license.(see LICENSE file).It is functionally equivalent to a two-term BSD copyright withlanguage removed that is made unnecessary by the Berne convention.Seehttp://openbsd.org/policy.html for more information on copyrights.

Acknowledgements

This software has been originally created in context of the DFG project GO 3192/1-1“Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”.The funders had no role in study design, data collection and analysis.

About

Python library and command line script , for fast extraction of subtrees of fairly large trees, consisting of millions of nodes, such as the NCBI taxonomy tree.

Releases

31tags

Packages

No packages published

Movatterモバイル変換

License

ggonnella/fastsubtrees

Folders and files

Latest commit

History

Repository files navigation

Fastsubtrees

Introduction

Working with the library

Installation

Command line interface

CLI example: working with the NCBI taxonomy tree

Using NtSubtree

API

API example: working with the NCBI taxonomy tree

Using NtSubtree

Docker

Tests

Benchmarks

Example application: Genome attributes viewer

Local installation and startup

Other subpackages

NtSubtree

ntdownload

ntmirror

Internals

Tree construction algorithm

Parallelizing the tree construction

Community guidelines

Documentation

Licence

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages