Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Nature Methods: RNA foundation model (together with RhoFold)

License

NotificationsYou must be signed in to change notification settings

ml4bio/RNA-FM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Update March 2024:

  1. Tutorials for RNA family clustering and RNA type classification &Tutorial video (in Chinese).
  2. mRNA-FM, a foundation model pre-trained on coding sequences (CDS) in mRNA is now released! The model can take into CDSs and represent them with contextual embeddings, benefiting mRNA and protein related tasks.

This repository contains codes and pre-trained models forRNA foundation model (RNA-FM).RNA-FM outperforms all tested single-sequence RNA language models across a variety of structure prediction tasks as well as several function-related tasks.You can find more details aboutRNA-FM in our paper,"Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" (Chen et al., 2022).

Overview

Citation
@article{chen2022interpretable,title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}
Table of contents

Create Environment with Conda

First, download the repository and create the environment.

git clone https://github.com/ml4bio/RNA-FM.gitcd ./RNA-FMconda env create -f environment.yml

Then, activate the "RNA-FM" environment and enter into the workspace.

conda activate RNA-FMcd ./redevelop

Access pre-trained models.

Download pre-trained models fromthis gdrive link and place the pth files into thepretrained folder.

Apply RNA-FM with Existing Scripts.

1. Embedding Extraction.

python launch/predict.py --config="pretrained/extract_embedding.yml" \--data_path="./data/examples/example.fasta" --save_dir="./resuts" \--save_frequency 1 --save_embeddings

RNA-FM embeddings with shape of (L,640) will be saved in the$save_dir/representations.

As For mRNA-FM, you can call it with an extra argument,MODEL.BACKBONE_NAME:

python launch/predict.py --config="pretrained/extract_embedding.yml" \--data_path="./data/examples/example.fasta" --save_dir="./resuts" \--save_frequency 1 --save_embeddings --save_embeddings_format raw MODEL.BACKBONE_NAME mrna-fm

2. Downstream Prediction - RNA secondary structure.

python launch/predict.py --config="pretrained/ss_prediction.yml" \--data_path="./data/examples/example.fasta" --save_dir="./resuts" \--save_frequency 1

The predicted probability maps will be saved in form of.npy files, and the post-processed binary predictions will be saved in form of.ct files. You can find them in the$save_dir/r-ss.

3. Online Version - RNA-FM server.

If you have any trouble with the deployment of the local version of RNA-FM, you can access its online version from this link,RNA-FM server.You can easily submit jobs on the server and download results from it afterwards, without setting up environment and occupying any computational resources.

Quick Start for Further Development.

Python 3.8 (maybe higher version) and PyTorch are the prerequisite packages which you must have installed to use this repository.You can installrna-fm in your own environment with the following pip command if you just want touse the pre-trained language model.you can either install rna-fm from PIPY:

pip install rna-fm

or installrna-fm from github:

cd ./RNA-FMpip install .

After installation, you can load the RNA-FM and extract its embeddings with the following code:

import torchimport fm# Load RNA-FM modelmodel, alphabet = fm.pretrained.rna_fm_t12()batch_converter = alphabet.get_batch_converter()model.eval()  # disables dropout for deterministic results# Prepare datadata = [    ("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),    ("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),]batch_labels, batch_strs, batch_tokens = batch_converter(data)# Extract embeddings (on CPU)with torch.no_grad():    results = model(batch_tokens, repr_layers=[12])token_embeddings = results["representations"][12]

More tutorials can be found fromhttps://ml4bio.github.io/RNA-FM/. The related notebooks are stored in thetutorials folder.

As for mRNA-FM, the above code needs a slight revision. To be noted, the length of input RNA sequences should be the multiple of 3 to ensure the sequence can be tokenized into a series of codons (3-mer).

import torchimport fm# Load mRNA-FM modelmodel, alphabet = fm.pretrained.mrna_fm_t12()batch_converter = alphabet.get_batch_converter()model.eval()  # disables dropout for deterministic results# Prepare datadata = [    ("CDS1", "AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),    ("CDS2", "AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("CDS3", "AUGCGAUUCNCGUUCCC--CCGCCUCC"),]batch_labels, batch_strs, batch_tokens = batch_converter(data)# Extract embeddings (on CPU)with torch.no_grad():    results = model(batch_tokens, repr_layers=[12])token_embeddings = results["representations"][12]

Related RNA Language Models (BERT-style)

ShorthandCodeSubjectLayersEmbed DimMax LengthInputTokenDatasetDescriptionYearPublisher
RNA-FMYesncRNA126401024SeqbaseRNAcentral 19 (23 million samples)The first RNA language model for general purpose2022.04arxiv/bioRxiv
RNABERTYesncRNA6120440SeqbaseRNAcentral (762370) & Rfam 14.3 dataset (trained with partial MSA)Specialized in structural alignment and clustering2022.02NAR Genomics and Bioinformatics
UNI-RNANoRNA241280$\infty$SeqbaseRNAcentral & nt & GWH (1 billion)A general model with larger scale and datasets than RNA-FM2023.07bioRxiv
RNA-MSMYesncRNA127681024MSAbase4069 RNA families from Rfam 14.7A model utilize evolutionary information from MSA directly2023.03NAR
SpliceBERTYespre-mRNA61024512Seqbase2 million precursor messenger RNA (pre-mRNA)Specialized in RNA splicing of pre-mRNA2023.05bioRxiv
CodonBERTNomRNA CDS12768512*2Seqcodon (3mer)10 million mRNAs from NCBIOnly focus on CDS of mRNA without UTRs2023.09bioRxiv
UTR-LMYesmRNA 5'UTR6128$\infty$Seqbase700K 5'UTRs from Ensembl & eGFP & mCherry & CaoUsed for 5'UTR and mRNA expression related tasks2023.10bioRxiv
3UTRBERTYesmRNA 3'UTR12768512Seqk-mer20,362 3'UTRsUsed for 3'UTR mediated gene regulation tasks2023.09bioRxiv
BigRNANoDNA---Seq-thousands of genome-matched datasetstissue-specific RNA expression, splicing, microRNA sites, and RNA binding protein2023.09bioRxiv

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

@article{shen2024accurate,title={Accurate RNA 3D structure prediction using a language model-based deep learning approach},author={Shen, Tao and Hu, Zhihang and Sun, Siqi and Liu, Di and Wong, Felix and Wang, Jiuming and Chen, Jiayang and Wang, Yixuan and Hong, Liang and Xiao, Jin and others},journal={Nature Methods},pages={1--12},year={2024},publisher={Nature Publishing Group US New York}}

For more details:

@article{chen2022interpretable,title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}

The model of this code builds on theesm sequence modeling framework.And we usefairseq sequence modeling framework to train our RNA language modeling.We very appreciate these two excellent works!

License

This source code is licensed under the MIT license found in theLICENSE filein the root directory of this source tree.

About

Nature Methods: RNA foundation model (together with RhoFold)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp