- Notifications
You must be signed in to change notification settings - Fork1
DessimozLab/foldtree2
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
FoldTree2 is a Python package and toolkit for inferring phylogenetic trees from protein structures using maximum likelihood methods. It provides tools for converting protein structure files (PDBs) into graph representations, deriving structural alignments, and building phylogenetic trees based on structural data.
- PDB to Graph Conversion: Convert protein structures into graph-based representations suitable for machine learning and phylogenetic analysis.
- Custom Substitution Matrices: Generate and use structure-based substitution matrices for alignments.
- Maximum Likelihood Tree Inference: Build phylogenetic trees from structural alignments using maximum likelihood approaches.
- Flexible Pipeline: Modular scripts for each step: graph creation, encoding, alignment, and tree inference.
First create the environment
conda env create --name foldtree2 --file=foldtree2.ymlconda activate foldtree2
and then install the project with pip
pip install.This will install all required dependencies as specified inpyproject.toml andsetup.py.
FoldTree2 provides several command-line tools that are automatically installed and available system-wide:
foldtree2/ft2treebuilder: Main phylogenetic tree inference pipelinepdbs-to-graphs: Convert PDB files to graph representationsmakesubmat: Generate structure-based substitution matricesraxml-ng: Maximum likelihood phylogenetic inference (bundled RAxML-NG)mad: Minimal Ancestor Deviation tree rootinghex2maffttext/maffttext2hex: MAFFT format conversion utilities
All tools include help documentation accessible with the--help flag.
For most users, FoldTree2 provides pretrained models that can be used directly to infer phylogenetic trees from protein structures.
Build phylogenetic trees from a folder of PDB structures using pretrained models:
foldtree2 --model mergeddecoder_foldtree2_test \ --structures<YOURSTRUCTUREFOLDER> \ --outdir<RESULTSFOLDER>
This single command will:
- Convert PDB files to graph representations
- Use pretrained models to encode structural features
- Generate structure-based substitution matrices
- Create structural alignments
- Infer a maximum likelihood phylogenetic tree
mergeddecoder_foldtree2_test: General-purpose model for diverse protein structuressmall: Lightweight model for smaller datasets- Additional models may be available in the
models/directory
The pipeline generates several output files in your results directory:
- Phylogenetic tree:
.trefiles in Newick format - Alignments:
.alnfiles showing structural alignments - Substitution matrices: Custom matrices based on structural similarity
- Log files: Detailed information about the inference process
For advanced users who want to train their own models or work with specialized datasets, FoldTree2 provides a complete training pipeline.
Convert your PDB files to a graph HDF5 dataset suitable for training:
pdbs-to-graphs<input_pdb_dir><training_graphs.h5> --aapropcsv config/aaindex1.csv
FoldTree2 provides several training scripts with different features:
python learn_monodecoder.py \ --dataset<training_graphs.h5> \ --modelname<my_custom_model> \ --epochs 100 \ --batch-size 20 \ --hidden-size 256 \ --embedding-dim 128 \ --outdir ./models/
See the complete list of options with--help.
For advanced features like distributed training, automatic checkpointing, and logging:
python learn_lightning.py \ --dataset<training_graphs.h5> \ --modelname<my_lightning_model> \ --epochs 100 \ --batch-size 20 \ --learning-rate 1e-4 \ --outdir ./models/ \ --clip-grad
See the complete list of options with--help.
--dataset: Path to your HDF5 graph dataset--modelname: Name for your trained model--epochs: Number of training epochs (default: 100)--batch-size: Training batch size (default: 20)--hidden-size: Hidden layer dimensions (default: 256)--embedding-dim: Embedding dimensions (default: 128)--learning-rate: Learning rate (default: 1e-4)--clip-grad: Enable gradient clipping for stability
Create structure-based substitution matrices using your trained model:
makesubmat \ --modelname<my_custom_model> \ --modeldir ./models/ \ --datadir<data_dir> \ --outdir_base<results_dir> \ --dataset<input_graphs.h5> \ --encode_alns
This script has utilities to download structures from the AFDB cluster database, align clusters as reference alignments using Foldseek, encode structures and derive substitution matrices.
See the complete list of options with--help.
Once trained, use your custom model in the main pipeline:
foldtree2 --model<my_custom_model> \ --structures<YOURSTRUCTUREFOLDER> \ --outdir<RESULTSFOLDER>
- GPU Acceleration: Training is significantly faster with CUDA-enabled GPUs
- Dataset Size: Larger, more diverse datasets generally produce better models
- Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and architectures
- Monitoring: Use TensorBoard logs to monitor training progress
- Checkpointing: Save model checkpoints regularly to resume training if interrupted
- Python 3.7+
- See
pyproject.tomlorsetup.pyfor a full list of dependencies.
MIT License (see LICENSE.txt)
Dave Moi (dmoi@unil.ch)
For more details, see the source code and scripts in the repository.
About
foldtree2
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
