DessimozLab/read2treePublic

NotificationsYou must be signed in to change notification settings
Fork19
Star155

a tool for inferring species tree from sequencing reads

License

MIT license

155 stars 19 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 587 Commits
.github/workflows		.github/workflows
archive		archive
bin		bin
read2tree		read2tree
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Repository files navigation

read2tree

read2tree is a software tool that allows obtaining alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bypasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtering, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatenation.

read2tree works in linux with

New release

We are now releasing Read2Tree v2.0.0 with improved speed and logging. As the aligner, we are now using minimap2minimap2. Also, MAFFT and IQtree are now using multiple threads. We would suggest running r2t with--debug which helps to debug later. Please note that arguments have slightly changed in this release(see below for details).

Read2Tree Talk:

You can watch David Dylus's presentation on Read2Tree as part of the SIBin silico talks.

Read2Tree publication

You can cite Read2Tree published inNature Biotechnology:

David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck & Christophe Dessimoz.Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree.  Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01753-4.

Installation

There are three ways to install read2tree. You can choose either of them.

1) Installation from source

To set up read2tree on your local machine from source please follow the instructions below.

First, we need to create a freshconda environment:

conda create -n r2t python=3.10.8

Prerequisites

The following python packages are needed:numpy,scipy,cython,lxml,tqdm,pysam,pyparsing,requests,filelock,natsort,pyyaml,biopython,ete3,dendropy.

You can install all of them using.

conda install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml filelock libdeflate libcurlconda install -c bioconda dendropy pysam

Besides, you need software packages includingmafft (multiple sequence aligner),iqtree (phylogenomic inference),minimap2 (long and short read mappers), andsamtools which could be installed using conda.For this version, the--read_type argument accepts any minimap2 options string that defines how reads are aligned to the reference. For example, it could be-ax sr,-ax map-hifi or-ax map-ont. You can also pass--threads 40 to be used with minimap2.

conda install -c bioconda mafft iqtree minimap2 samtools

Then, you can install the read2tree package after downlaoding the package from this GitHub repo using

git clone https://github.com/DessimozLab/read2tree.git -b minimap2cd read2treepython setup.py install

Run

To run read2tree two things are required as input:

The DNA sequencing reads as FASTQ file(s).
A set of reference orthologous groups, i.e. marker genes.In our wikipage, you may find information on how to obtain the marker genes usingOMA browser. You can set the value ofMaximum nr of markers as 200 or 400. Once you downloaded the tgz file, run this

tar xvzf  marker_genes_*.tgz ls marker_genes/*.fna | wc -lcat marker_genes/*.fna > dna_ref.fa

output

The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option--tree for inferring the species tree using IQTREE as default.

Single species mode

read2tree --tree --standalone_path marker_genes/ --reads read_1.fastq read_2.fastq  --output_path output --dna_reference  dna_ref.fa

Multiple species mode

step1

read2tree  --step 1marker  --standalone_path marker_genes  --dna_reference dna_ref.fa --output_path output  --debug

step2

The following could be run in parallel.

read2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species1_R1.fastq species2_R2.fastq  --output_path output --debugread2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species2_R1.fastq species2_R2.fastq  --output_path output --debugread2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species3_R1.fastq species3_R2.fastq  --output_path output  --debug

step3

read2tree  --step 3combine --standalone_path marker_genes  --dna_reference dna_ref.fa  --output_path output  --tree --debug

bootstraping

To have bootstrap values a metric for quality of internal nodes, you can run the following

thread=20iqtree -T ${thread} -s output/concat_*_aa.phy  -bb 1000

The.phy file is eitherconcat_sample_aa.phy orconcat_merge_aa.phy corresponding to single- or multi-species mode.

It is also possible to usetrimal for trimming msatrimal -in <inputfile> -out <outputfile> -automated1

DNA-mode tree inference

For closely related species, the user can infer tree using MSA of nucleotide sequences.

thread=20iqtree -T ${thread} -s output/concat_*_dna.phy

Test example

The goal of this test example is to infer species tree for Mus musculus using its sequencing reads. You can download the full read data from fromSRR5171076 usingsra-tools. Alternatively, a small read dataset is provided in thetests folder. For this example, we consider five species including Mnemiopsis leidyi, Xenopus laevis, Homo sapiens, Gorilla gorilla, and Rattus norvegicus as the reference. UsingOMA browser, we downloaded 20 marker genes of these five species as the reference orthologous groups, located in the foldertests/mareker_genes.

cd testsread2tree --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq  --output_path output --dna_reference  dna_ref.fa

Run test example using docker

(to be updated )

docker run --rm -i -v $PWD/tests:/input -v $PWD/tests/:/reads -v $PWD/outside_docker_out:/inside_docker_out -v $PWD/run:/run ${{ env.TEST_TAG }} --tree --standalone_path /input/marker_genes --dna_reference /input/dna_ref.fa --reads /reads/sample_1.fastq --output_path /inside_docker_out/output --debug --threads 1

output files

You can check the inferred species tree for the sample and five reference species in Newick format:

$cat  output/tree_sample_1.nwk(sample_1:0.0106979811,((HUMAN:0.0041202790,GORGO:0.0272785216):0.0433094119,(XENLA:0.1715052824,MNELE:0.9177670816):0.1141311779):0.0613339433,RATNO:0.0123413734);

For the full description of output files please check our wikipage.

Note that we consider species names as 5-letter codes e.g. XENLA = Xenopus laevis. If you want to rerun your analysis, make sure that you moved/deleted the files. Otherwise, read2tree continues the progress of previous analysis.

For running on clusters, you can run the first step of read2tree such that folders 01, 02 and 03 are computed (this allows for mapping). This can be done using the '--reference' option. Since read2tree re-orders the OGs into the included species, it is possible to split the mapping step per species using multiple threads for the mapper. For this the '--single_mapping' option is available.

Hint: As read2tree exploits theprogress package, the user can benefit from continuing unfinished runs. However, if you want to conduct a new analysis with different inputs, you need to remove output of previous runs or change theoutput_path.

Details of arguments

To see the details of arguments, please take look at our wikipage

Possible issues

It seems that minimap2 doesn't have the preset option-x sr for short reads in older versions like 2.1. Then, you might get

Shell err: b"[E::main] unknown preset 'sr'\n"

Updating minimap2 to version 2.30 should fix the issue.

Installing on MAC sometimes drops this error:

raise ValueError, 'unknown locale: %s' % localenameValueError: unknown locale: UTF-8

This can be mitigated using:

export LC_ALL=en_US.UTF-8export LANG=en_US.UTF-8

Change log

version 2.0.1:
- fixing a few bugs
version 2.0.0:
- pre-release: improve logging and make minimap2 as default aligner
version 1.5:
- using minimap2 as the read mapper in the minimap2 branch
version 0.1.5:
- fix issue with UnknownSeq being removed in Biopython>1.80
- removing unused modeltester wrappers
version 0.1.4:
- allow reference folders not named marker_genes (#12)
- update environment.yml file to contain all dependencies (#16)
- documentation improvements
- CI/CD pipeline
version 0.1.3:
- improvements of documentation
- adding support for docker
- small bugfixes
version 0.1.2: packaging
version 0.1.0: Adding covid analysis
version 0.0: Initial work