- Notifications
You must be signed in to change notification settings - Fork8
ProCyon: A multimodal foundation model for protein phenotypes
License
mims-harvard/ProCyon
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ProCyon is an open-source model for predicting protein phenotypes across scales.This repository provides the official implementation of the model as described in ouroverview page and ourpaper.Our associated HuggingFace collection containing model weights and datasets can be found at the following links:
- Dataset:ProCyon-Instruct
- Full model:ProCyon-Full
- Benchmarking model:ProCyon-Split
- Binding prediction model:ProCyon-Bind
Requirements:
- CUDA toolkit, particularly
nvcc
- Sign up for Huggingface permissions for LLaMA-3 atthis link. You'll need this to use ProCyon-Full and ProCyon-Bind.
We recommend installing withuv, but install can also be done viapip
alone. Theprocyon
package used to interact with pre-trained models or train new models can be installed via
cd /path/to/ProCyon# RECOMMENDED: use uv to install. Two options depending on whether# you want to use the default .venv virtual env that# uv will create# OPTION 1: let uv create and manage the virtual enviroment, requires# uv to already be installeduv sync --extra builduv sync --extra build --extra compileuv pip install -e .source .venv/bin/activate# OPTION 2: create virtual environment with choice of name and pathpython3 -m venv ./procyon_venvsource ./procyon_venv/bin/activatepython3 -m pip install uvuv pip install -r pyproject.toml --extra builduv pip install -r pyproject.toml --extra build --extra compileuv pip install -e .# OR if omitting uvpython3 pip install -e .
Installation withuv
should take less than 10 minutes, depending on thespeed of your internet connection for downloading packages.
In addition to the package code, ProCyon also requires pre-trained weights for associatedmodels (e.g. Llama-3, ESM2) as well as access to the ProCyon-Instruct dataset.You'll need to request access to the LLaMA-3 model through the model pagehere.These dependencieswill all be stored in a single directory, which we denoteDATA_DIR
.
DATA_DIR=/path/to/datamkdir $DATA_DIRcd $DATA_DIR# Clone ProCyon-Instruct dataset from HuggingFacegit clone git@hf.co:datasets/mims-harvard/ProCyon-Instruct# Clone model weights for associated Llama models from HuggingFace# Llama-3-8b for ProCyon-Fullcd /path/to/llama3/# Ensure you've signed up for LLaMA-3 accessgit clone https://huggingface.co/meta-llama/Meta-Llama-3-8Becho "LLAMA3_PATH=/path/to/llama3/Meta-Llama-3-8B" >> .env# Llama-2-7b for ProCyon-Splitcd ../llama-2-7b-hfgit clone git@hf.co:meta-llama/Llama-2-7b-hf# Add a `.env` file which the `procyon` package will use to find the `DATA_DIR`cd /path/to/ProCyonecho "DATA_DIR=\"$DATA_DIR\"" > .envecho "HOME_DIR=\"$(pwd)\"" >> .env
Version note: We are aware of a bug where havingtransformers>4.31.0
changes generated model outputs. Please ensure yourtransformers
version is set to 4.31.0 (as in environment requirements) for inference of ProCyon.
For the core capabilities of ProCyon models, please see the provided demonotebooks. Both examples should run in less than 5 minutes depending on thespeed of your GPU.
To see how to perform benchmarking runs comparing the performance of ProCyon models tovarious other baselines and models, please see theexample configs and scriptsor theevaluation README.
For details on how to reproduce the various experiments and results in our manuscript, please seethereproducibility README.
For details on training a ProCyon model and example scripts, please see thetraining README.
@article {Queen2024.12.10.627665, author = {Queen, Owen and Huang, Yepeng and Calef, Robert and Giunchiglia, Valentina and Chen, Tianlong and Dasoulas, George and Tai, LeAnn and Ektefaie, Yasha and Noori, Ayush and Brown, Joseph and Cobley, Tom and Hrovatin, Karin and Hartvigsen, Tom and Theis, Fabian and Pentelute, Bradley L. and Khurana, Vikram and Kellis, Manolis and Zitnik, Marinka}, title = {ProCyon: A multimodal foundation model for protein phenotypes}, elocation-id = {2024.12.10.627665}, year = {2024}, doi = {10.1101/2024.12.10.627665}, URL = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665}, eprint = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665.full.pdf}, journal = {bioRxiv}}