Our model and accompanying scripts are distributed for academic and non-commercial use only. Please refer to the LICENSE attached to this repo and reach out if you have any questions.

Set up python environment

using conda

conda env create -f environment.yml python==3.10.8conda activate glm-envpip install torch==1.12.1+cu116  torchvision==0.13.1+cu116 -f https://download.pytorch.org/whl/torch_stable.html

This set up was tested using python 3.10.8

Download model

The latest checkpoint of our model used for the preprint is available for download fromhttps://zenodo.org/record/7855545

mkdir model cd model wget https://zenodo.org/record/7855545/files/glm.bin

Compute gLM embeddings

gLM embeddings can be computed using the following steps:

1. Prepare two input files.

a) FASTA file of your proteins (amino acid sequences) in your contig

>prot_1MNYSHDNWSAILAHIGKPEELDTSARNAGALTRRREIRDAATLLRLGLAYGPGGMSLREVTAWAQLHDVATLSDVALLKRLRNAADWFGILAAQTLAVRAAVTGCTSGKRLRLVDGTAISAPGGGSAEWRLHMGYDPHTCQFTDFELTDSRDAERLDRFAQTADEIRIADRGFGSRPECIRSLAFGEADYIVRVHWRGLRWLTAEGMRFDMMGFLRGLDCGKNGETTVMIGNSGNKKAGAPFPARLIAVSLPPEKALISKTRLLSENRRKGRVVQAETLEAAGHVLLLTSLPEDEYSAEQVADCYRLRWQIELAFKRLKSLLHLDALRAKEPELAKAWIFANLLAAFLIDDIIQPSLDFPPRSAGSEKKN>prot_2MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR

b) subcontig to protein mapping with orientationin the following format.

Where '-' refers to reverse direction and '+' refers to forward direction relative to the rest of the contig.

Make sure the number of proteins in subcontigs does not exceed max_seq_length = 30.

contig_0  +prot_1;-prot_2;-prot_3;-prot_4;-prot_5;+prot_6;-prot_7;+prot_8;+prot_9;+prot_10;-prot_11;-prot_12;-prot_13;-prot_14;-prot_15;-prot_16;contig_1  +prot_17;-prot_18;-prot_19;-prot_20;-prot_21;+prot_22;-prot_23;+prot_24;+prot_25;+prot_26;-prot_27;

see contig_to_prots.tsv and test.fa in example_data as an example.

2. compute pLM embeddings.

We useesm2 to embed proteins.

cd datapython plm_embed.py example_data/inference_example/test.fa example_data/inference_example/test.esm.embs.pkl

we provide the expected output example_data/inference_example/test.esm.embs.pkl for your reference and on a A100 GPU this test example took less than 2 minutes to complete.

3. batch your data for gLM inference.

cd data# make output directorymkdir batched_data  python batch_data.py example_data/inference_example/test.esm.embs.pkl example_data/inference_example/contig_to_prots.tsv example_data/batched_data

The output data directory (batched_data) now contains two files. The output directory (batched_data) which contains batch.pkl and prot_index_dict.pkl files. The former is the input containing your data input embeddings, and the latter contains the dictionary mapping from protein index to protein ID.

we provide the expected output data/example_data/batched_data/ for your reference and this particular test example took us less than 1 minutes to run.

4. compute gLM embeddings.

cd datapython ../gLM/glm_embed.py -d example_data/batched_data -m ../model/glm.bin -b 100 -o test_results

If you come across GPU memory errors, try reducing batch size (-b).

gLM embeddings will be saved as *.glm.embs.pkl file in the output directory.

You can output all inference results (plm_embs/glm_embs/prot_ids/outputs/output_probabilitess) by adding --all_results/-a flag. This will be saved as a *.results.pkl file in the output directory.

You can also output attention matrices by adding --attention flag. Attentions will be saved for post processing in your ourput directory *.attention.pkl

We provide the expected output in data/test_results/results/batch.pkl.glm.embs.pkl and the expected runtime for this on A100 is ~2 minutes.

We are working on making inference code available as a colab notebook. sostay tuned.

Train your own gLM on your custom dataset

We provide the training script for gLM for your custom dataset.

cd datapython ../gLM/train.py -d example_data/training_example -o test_train_dir

The data directory (data/example_data/training_example) contains batched training data which can be generated using batch_data.py (see sections 1-3 in "Compute gLM embeddings" above). Make sure pkl files containing training data starts with "train" and pkl files containing eval data starts with "eval".For example:

ls example_data/training_exampleeval.0.PC_100.pkl  train.0.PC_100.pkl  train.1.PC_100.pkl

python train.py -h will show many hyperparameter flags that can be tweaked to suit your training.

Training log file, checkpoints and pretrained models are stored in the output directory.

Note: When there are checkpoints already saved in the specified output directory, the script will automatically load the latest checkpoint and continue training from there.

Visualization

We included scripts used for downstream analyses and visualizations (e.g. EC number analysis and operon prediction) in gLM directory.

Citations

If you find gLM useful in your work, please cite our paper:

Hwang, Y. Cornman, A. Kellogg, E. Ovchinnikov, S. and Girguis, P. (2023) "Genomic language model predicts protein co-regulation and function", BioRxiv

@article {Hwang2023.04.07.536042,author ={Yunha Hwang and Andre L. Cornman and Elizabeth H. Kellogg and Sergey Ovchinnikov and Peter R. Girguis},title ={Genomic language model predicts protein co-regulation and function},elocation-id ={2023.04.07.536042},year ={2023},doi ={10.1101/2023.04.07.536042},publisher ={Cold Spring Harbor Laboratory},URL ={https://www.biorxiv.org/content/early/2023/10/15/2023.04.07.536042},eprint ={https://www.biorxiv.org/content/early/2023/10/15/2023.04.07.536042.full.pdf},journal ={bioRxiv}}

About

Genomic language model predicts protein co-regulation and function

www.biorxiv.org/content/10.1101/2023.04.07.536042v3

Releases2

gLM Latest

Jan 15, 2024

+ 1 release

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

gLM

Genomic Language Model

License

Set up python environment

using conda

Download model

Compute gLM embeddings

1. Prepare two input files.

2. compute pLM embeddings.

3. batch your data for gLM inference.

4. compute gLM embeddings.

Train your own gLM on your custom dataset

Visualization

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Languages

Movatterモバイル変換

License

y-hwang/gLM

Folders and files

Latest commit

History

Repository files navigation

gLM

Genomic Language Model

License

Set up python environment

using conda

Download model

Compute gLM embeddings

1. Prepare two input files.

2. compute pLM embeddings.

3. batch your data for gLM inference.

4. compute gLM embeddings.

Train your own gLM on your custom dataset

Visualization

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Languages

Packages