- Notifications
You must be signed in to change notification settings - Fork9
Genomic language model predicts protein co-regulation and function
License
y-hwang/gLM
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the training and inference code for gLM described in preprint: "Genomic language model predicts protein co-regulation and function"
Our model and accompanying scripts are distributed for academic and non-commercial use only. Please refer to the LICENSE attached to this repo and reach out if you have any questions.
© President and Fellows of Harvard College 2023.
conda env create -f environment.yml python==3.10.8conda activate glm-envpip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 -f https://download.pytorch.org/whl/torch_stable.html
This set up was tested using python 3.10.8
The latest checkpoint of our model used for the preprint is available for download fromhttps://zenodo.org/record/7855545
mkdir model cd model wget https://zenodo.org/record/7855545/files/glm.bin
gLM embeddings can be computed using the following steps:
a) FASTA file of your proteins (amino acid sequences) in your contig
>prot_1MNYSHDNWSAILAHIGKPEELDTSARNAGALTRRREIRDAATLLRLGLAYGPGGMSLREVTAWAQLHDVATLSDVALLKRLRNAADWFGILAAQTLAVRAAVTGCTSGKRLRLVDGTAISAPGGGSAEWRLHMGYDPHTCQFTDFELTDSRDAERLDRFAQTADEIRIADRGFGSRPECIRSLAFGEADYIVRVHWRGLRWLTAEGMRFDMMGFLRGLDCGKNGETTVMIGNSGNKKAGAPFPARLIAVSLPPEKALISKTRLLSENRRKGRVVQAETLEAAGHVLLLTSLPEDEYSAEQVADCYRLRWQIELAFKRLKSLLHLDALRAKEPELAKAWIFANLLAAFLIDDIIQPSLDFPPRSAGSEKKN>prot_2MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR
b) subcontig to protein mapping with orientationin the following format.
Where '-' refers to reverse direction and '+' refers to forward direction relative to the rest of the contig.
Make sure the number of proteins in subcontigs does not exceed max_seq_length = 30.
contig_0 +prot_1;-prot_2;-prot_3;-prot_4;-prot_5;+prot_6;-prot_7;+prot_8;+prot_9;+prot_10;-prot_11;-prot_12;-prot_13;-prot_14;-prot_15;-prot_16;contig_1 +prot_17;-prot_18;-prot_19;-prot_20;-prot_21;+prot_22;-prot_23;+prot_24;+prot_25;+prot_26;-prot_27;
see contig_to_prots.tsv and test.fa in example_data as an example.
We useesm2 to embed proteins.
cd datapython plm_embed.py example_data/inference_example/test.fa example_data/inference_example/test.esm.embs.pkl
we provide the expected output example_data/inference_example/test.esm.embs.pkl for your reference and on a A100 GPU this test example took less than 2 minutes to complete.
cd data# make output directorymkdir batched_data python batch_data.py example_data/inference_example/test.esm.embs.pkl example_data/inference_example/contig_to_prots.tsv example_data/batched_data
The output data directory (batched_data) now contains two files. The output directory (batched_data) which contains batch.pkl and prot_index_dict.pkl files. The former is the input containing your data input embeddings, and the latter contains the dictionary mapping from protein index to protein ID.
we provide the expected output data/example_data/batched_data/ for your reference and this particular test example took us less than 1 minutes to run.
cd datapython ../gLM/glm_embed.py -d example_data/batched_data -m ../model/glm.bin -b 100 -o test_results
If you come across GPU memory errors, try reducing batch size (-b).
gLM embeddings will be saved as *.glm.embs.pkl file in the output directory.
You can output all inference results (plm_embs/glm_embs/prot_ids/outputs/output_probabilitess) by adding --all_results/-a flag. This will be saved as a *.results.pkl file in the output directory.
You can also output attention matrices by adding --attention flag. Attentions will be saved for post processing in your ourput directory *.attention.pkl
We provide the expected output in data/test_results/results/batch.pkl.glm.embs.pkl and the expected runtime for this on A100 is ~2 minutes.
We are working on making inference code available as a colab notebook. sostay tuned.
We provide the training script for gLM for your custom dataset.
cd datapython ../gLM/train.py -d example_data/training_example -o test_train_dir
The data directory (data/example_data/training_example) contains batched training data which can be generated using batch_data.py (see sections 1-3 in "Compute gLM embeddings" above). Make sure pkl files containing training data starts with "train" and pkl files containing eval data starts with "eval".For example:
ls example_data/training_exampleeval.0.PC_100.pkl train.0.PC_100.pkl train.1.PC_100.pkl
python train.py -h will show many hyperparameter flags that can be tweaked to suit your training.
Training log file, checkpoints and pretrained models are stored in the output directory.
Note: When there are checkpoints already saved in the specified output directory, the script will automatically load the latest checkpoint and continue training from there.
We included scripts used for downstream analyses and visualizations (e.g. EC number analysis and operon prediction) in gLM directory.
If you find gLM useful in your work, please cite our paper:
Hwang, Y. Cornman, A. Kellogg, E. Ovchinnikov, S. and Girguis, P. (2023) "Genomic language model predicts protein co-regulation and function", BioRxiv
@article {Hwang2023.04.07.536042,author ={Yunha Hwang and Andre L. Cornman and Elizabeth H. Kellogg and Sergey Ovchinnikov and Peter R. Girguis},title ={Genomic language model predicts protein co-regulation and function},elocation-id ={2023.04.07.536042},year ={2023},doi ={10.1101/2023.04.07.536042},publisher ={Cold Spring Harbor Laboratory},URL ={https://www.biorxiv.org/content/early/2023/10/15/2023.04.07.536042},eprint ={https://www.biorxiv.org/content/early/2023/10/15/2023.04.07.536042.full.pdf},journal ={bioRxiv}}
About
Genomic language model predicts protein co-regulation and function