bioinfodlsu/phage-host-predictionPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star16

Published in PLOS ONE. Phage-host interaction prediction tool that uses protein language models to represent the receptor-binding proteins of phages. It presents improvements over using handcrafted sequence properties and eliminates the need to manually extract and select features from phage sequences

License

MIT license

16 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
experiments		experiments
sample_results		sample_results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
environment_experiments.yaml		environment_experiments.yaml
figure.png		figure.png
phiembed.py		phiembed.py
sample.fasta		sample.fasta
train.py		train.py

Repository files navigation

PHIEmbed (Phage-Host Interaction Prediction with Protein Embeddings)

PHIEmbed is a phage-host interaction prediction tool that uses protein language models to represent the receptor-binding proteins of phages. It presents improvements over using handcrafted (manually feature-engineered) sequence properties and eliminates the need to manually extract and select features from phage sequences.

Paper:https://doi.org/10.1371/journal.pone.0289030

If you find our work useful, please consider citing:

@article{10.1371/journal.pone.0289030,    doi = {10.1371/journal.pone.0289030},    author = {Gonzales, Mark Edward M. AND Ureta, Jennifer C. AND Shrestha, Anish M. S.},    journal = {PLOS ONE},    publisher = {Public Library of Science},    title = {Protein embeddings improve phage-host interaction prediction},    year = {2023},    month = {07},    volume = {18},    url = {https://doi.org/10.1371/journal.pone.0289030},    pages = {1-22},    number = {7}}

You can also find PHIEmbed onbio.tools.

📰 News

13 Jan 2025 - We released a new phage-host interaction prediction tool,PHIStruct (published inBioinformatics), which incorporates protein structure information and works especially well for phages with receptor-binding proteins that have low sequence similarity to those of known phages. Learn more about ithere.
24 Apr 2024 - We added scripts to simplify running and training our tool. Instructionshere.
23 Feb 2024 - We presented our work at theeAsia AMR Workshop 2024 held virtually and in person in Tokyo, Japan, and attended by antimicrobial resistance (AMR) researchers from Thailand, USA, Australia, Japan, and the Philippines. Slideshere.
01 Dec 2023 - Presenting this work, the lead author (Mark Edward M. Gonzales) won2nd Prize at the 2023 Magsaysay Future Engineers/Technologists Award. This award is conferred by the National Academy of Science and Technology, the highest recognition and scientific advisory body of the Philippines, to recognize outstanding research outputs on engineering and technology at the collegiate level. Presentationhere (29:35–39:51) and slideshere.
24 Jul 2023 - Ourpaper is now published inPLOS ONE.

↑Return toTable of Contents.

🚀 Installation & Usage

Operating System: Windows, Linux, or macOS

Clone the repository:

git clone https://github.com/bioinfodlsu/phage-host-predictioncd phage-host-prediction

Create a virtual environment with all the necessary dependencies installed via Conda (we recommend usingMiniconda):

conda env create -f environment.yaml

Activate this environment by running:

conda activate PHIEmbed

Running PHIEmbed

python3 phiembed.py --input <input_fasta> --model <model_joblib> --output <results_dir>

Replace<input_fasta> with the path to the FASTA file containing the receptor-binding protein sequences. A sample FASTA file is providedhere.
Replace<model_joblib> with the path to the trained model (recognized format: joblib or compressed joblib, framework: scikit-learn). Download our trained model from thislink. No need to uncompress, but doing so will speed up loading the model albeit at the cost of additional storage requirements. Refer to thisguide for the list of accepted compressed formats.
Replace<results_dir> with the path to the directory to which the results of running PHIEmbed will be written. The results of running PHIEmbed on the sample FASTA file are providedhere.

The results for each protein are written to a CSV file (without a header row). Each row contains two comma-separated values: a host genus and the corresponding prediction score (class probability). The rows are sorted in order of decreasing prediction score. Hence, the first row pertains to the top-ranked prediction.

Under the hood, this script first converts each sequence into a protein embedding using ProtT5 (the top-performing protein language model based on our experiments) and then passes the embedding to a random forest classifier trained on our entiredataset. If your machine has a GPU, it will automatically be used to accelerate the protein embedding generation step.

Note: Running this script for the first time may take a few extra minutes since it involves downloading a model (ProtT5, around 2 GB) from Hugging Face.

Training PHIEmbed

python3 train.py --input <training_dataset>

Replace<training_dataset> with the path to the training dataset. A sample can be downloadedhere.
The number of threads to be used for training can be specified using--threads. By default, it is set to -1 (that is, all threads are to be used).

The training dataset should be formatted as a CSV file (without a header row) where each row corresponds to a training sample. The first column is for the protein IDs, the second column is for the host genera, and the next 1,024 columns are for the components of the ProtT5 embeddings.

This script will output a gzip-compressed, serialized version of the trained model with filenamephiembed_trained.joblib.gz.

↑Return toTable of Contents.

📚 Description

Motivation: With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model.

Method: In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information.

Results: We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.

↑Return toTable of Contents.

🧪 Reproducing Our Results

Project Structure

Theexperiments folder contains the files and scripts for reproducing our results. Note that additional (large) files have to be downloaded (or generated) following the instructions in the Jupyter notebooks.

Click here to show/hide the list of directories, Jupyter notebooks, and Python scripts, as well as the folder structure.

Directories

Directory	Description
`inphared`	Contains the list of phage-host pairs in TSV format. The GenBank and FASTA files with the genomic and protein sequences of the phages, the embeddings of the receptor-binding proteins, and the phage-host-features CSV files should also be saved in this folder
`preprocessing`	Contains text files related to the preprocessing of host information and the selection of annotated RBPs
`rbp_prediction`	Contains the JSON file of the trained XGBoost model proposed byBoeckaertset al. (2022) for the computational prediction of receptor-binding proteins. Downloaded from thisrepository (under the MIT License)
`temp`	Contains intermediate output files during preprocessing and performance evaluation

↑Return toTable of Contents.

Jupyter Notebooks

Each notebook provides detailed instructions related to the required and output files, including the download links and where to save them.

Notebook	Description	Required Files	Output Files
`1. Sequence Preprocessing.ipynb`	Preprocessing of host information and selection of annotated receptor-binding proteins	GenomesDB (Partial. Complete populating following the instructions in the notebook), GenBank file of phage genomes and/or proteomes	FASTA files of genomic and protein sequences
`2. Exploratory Data Analysis.ipynb`	Exploratory data analysis	Protein embeddings (Part 1 andPart 2), Phage-host-features CSV files	–
`3. RBP Computational Prediction.ipynb`	Computational prediction of receptor-binding proteins	Protein embeddings (Part 1 andPart 2)	Protein embeddings (Part 1 andPart 2)
`3.1. RBP FASTA Generation.ipynb`	Generation of the FASTA files containing the RBP protein sequences	Protein embeddings (Part 1 andPart 2)	FASTA files of genomic and protein sequences
`4. Protein Embedding Generation.ipynb`	Generation of protein embeddings	FASTA files of genomic and protein sequences	Protein embeddings (Part 1 andPart 2)
`5. Data Consolidation.ipynb`	Generation of phage-host-features CSV files	FASTA files of genomic and protein sequences, Protein embeddings (Part 1 andPart 2)	Phage-host-features CSV files
`6. Classifier Building & Evaluation.ipynb`	Construction of phage-host interaction model and performance evaluation	Phage-host-features CSV files	Trained models
`6.1. Additional Model Evaluation (Specificity + PR Curve).ipynb`	Addition of metrics for model evaluation	Phage-host-features CSV files	–
`7. Visualization.ipynb`	Plotting oft-SNE and UMAP projections	Phage-host-features CSV files	–

↑Return toTable of Contents.

Python Scripts

Script	Description
`ClassificationUtil.py`	Contains the utility functions for the generation of the phage-host-features CSV files, construction of the phage-host interaction model, and performance evaluation
`ConstantsUtil.py`	Contains the constants used in the notebooks and scripts
`EDAUtil.py`	Contains the utility functions for exploratory data analysis
`RBPPredictionUtil.py`	Contains the utility functions for the computational prediction of receptor-binding proteins
`SequenceParsingUtil.py`	Contains the utility functions for preprocessing host information and selecting annotated receptor-binding proteins
`boeckaerts.py`	Contains the utility functions written byBoeckaertset al. (2021) for running their phage-host interaction prediction tool (with which we benchmarked our model). Downloaded from thisrepository (under the MIT License)

↑Return toTable of Contents.

Folder Structure

Once you have cloned this repository and finished downloading (or generating) all the additional required files following the instructions in the Jupyter notebooks, your folder structure should be similar to the one below:

phage-host-prediction (root)
- datasets
  - inphared
    - inphared
      - GenomesDB (Downoadpartial. Complete populating following the instructionshere)
        AB002632
        ...
- experiments
  - inphared
    - data (Download)
      - rbp.csv
      - rbp_embeddings_esm.csv
      - ...
    - embeddings (DownloadPart 1 andPart 2)
      - esm
      - esm1b
      - ...
    - fasta (Download)
      - complete
      - hypothetical
      - nucleotide
      - rbp
    - 16Sep2022_data_excluding_refseq.tsv
    - 16Sep2022_phages_downloaded_from_genbank.gb (Download)
  - models (Download)
    - boeckaerts.joblib
    - esm.joblib
    - ...
  - preprocessing
  - rbp_prediction
  - temp
  - 1. Sequence Preprocessing.ipynb
  - ...
  - ClassificationUtil.py
  - ...

↑Return toTable of Contents.

Dependencies

Operating System: Windows, Linux, or macOS

Create a virtual environment with all the necessary dependencies installed via Conda (we recommend usingMiniconda):

conda env create -f environment_experiments.yaml

Activate this environment by running:

conda activate PHIEmbed-experiments

Thanks to Dr. Paul K. Yu for sharing his environment configuration.

Click here to show/hide note on running the notebook for protein embedding generation.

The notebook4. Protein Embedding Generation.ipynb has a dependency (bio_embeddings) that requires it to be run on Unix or a Unix-like operating system. If you are using Windows, consider usingWindows Subsystem for Linux (WSL) or a virtual machine. We did not includebio_embeddings inenvironment_experiments.yaml to maintain cross-platform compatibility; you have to install it following the instructionshere.

Moreover, generating protein embeddings should ideally be done on a machine with a GPU. The largest (and best-performing) protein language model that we used, ProtT5, consumes 5.9 GB of GPU memory. If your local machine does not have a GPU or if its GPU has insufficient memory, we recommend using a cloud GPU platform.

UPDATE (12 Jun 2023): In May 2023, Google Colab upgraded its Python runtime, resulting in compatibility issues withbio_embeddings. An alternative cloud GPU platform is Paperspace, which provides aPyTorch 1.12 runtime that is compatible withbio_embeddings.

Click here to show/hide the complete list of Python libraries and modules used in this project (excluding those that are part of the Python Standard Library).

Library/Module	Description	License
`pyyaml`	Supports standard YAML tags and provides Python-specific tags that allow to represent an arbitrary Python object	MIT License
`jsonnet`	Domain-specific language for JSON	Apache License 2.0
`protobuf`	Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data	BSD 3-Clause "New" or "Revised" License
`regex`	Provides additional functionality over the standard`re` module while maintaining backwards-compatibility	Apache License 2.0
`nltk`	Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning	Apache License 2.0
`biopython`	Provides tools for computational molecular biology	Biopython License Agreement, BSD 3-Clause License
`ete3`	Provides functions for automated manipulation, analysis, and visualization of phylogenetic trees	GNU General Public License v3.0
`pandas`	Provides functions for data analysis and manipulation	BSD 3-Clause "New" or "Revised" License
`numpy`	Provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays	BSD 3-Clause "New" or "Revised" License
`scipy`	Provides efficient numerical routines, such as those for numerical integration, interpolation, optimization, linear algebra, and statistics	BSD 3-Clause "New" or "Revised" License
`scikit-learn`	Provides efficient tools for predictive data analysis	BSD 3-Clause "New" or "Revised" License
`xgboost`	Implements machine learning algorithms under the gradient boosting framework	Apache License 2.0
`imbalanced-learn`	Provides tools when dealing with classification with imbalanced classes	MIT License
`joblib`	Provides tools for lightweight pipelining in Python	BSD 3-Clause "New" or "Revised" License
`cudatoolkit`	Parallel computing platform and programming model for general computing on GPUs	NVIDIA Software License
`bio_embeddings`	Provides an interface for the use of language model-based biological sequence representations for transfer-learning	MIT License
`torch`	Optimized tensor library for deep learning using GPUs and CPUs	BSD 3-Clause "New" or "Revised" License
`transformers`	Provides pretrained models to perform tasks on different modalities such as text, vision, and audio	Apache License 2.0
`sentencepiece`	Unsupervised text tokenizer and detokenizer mainly for neural network-based text generation systems	Apache License 2.0
`matplotlib`	Provides functions for creating static, animated, and interactive visualizations	Matplotlib License (BSD-Compatible)
`umap-learn`	Implements uniform manifold approximation and projection, a dimensionality reduction technique	BSD 3-Clause "New" or "Revised" License

The descriptions are taken from their respective websites.

↑Return toTable of Contents.

💻 Authors

Mark Edward M. Gonzales
gonzales.markedward@gmail.com
Ms. Jennifer C. Ureta
jennifer.ureta@gmail.com
Dr. Anish M.S. Shrestha
anish.shrestha@dlsu.edu.ph

This is a research project under theBioinformatics Laboratory,Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Philippines.

This research was partly funded by theDepartment of Science and Technology – Philippine Council for Health Research and Development (DOST-PCHRD) under thee-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

About

doi.org/10.1371/journal.pone.0289030

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PHIEmbed (Phage-Host Interaction Prediction with Protein Embeddings)

Table of Contents

📰 News

🚀 Installation & Usage

Running PHIEmbed

Training PHIEmbed

📚 Description

🧪 Reproducing Our Results

Project Structure

Directories

Jupyter Notebooks

Python Scripts

Folder Structure

Dependencies

💻 Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

bioinfodlsu/phage-host-prediction

Folders and files

Latest commit

History

Repository files navigation

PHIEmbed (Phage-Host Interaction Prediction with Protein Embeddings)

Table of Contents

📰 News

🚀 Installation & Usage

Running PHIEmbed

Training PHIEmbed

📚 Description

🧪 Reproducing Our Results

Project Structure

Directories

Jupyter Notebooks

Python Scripts

Folder Structure

Dependencies

💻 Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages