ml4bio/RNA-FMPublic

NotificationsYou must be signed in to change notification settings
Fork32
Star287

Nature Methods: RNA foundation model (together with RhoFold)

License

MIT license

287 stars 32 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
docs		docs
fm		fm
redevelop		redevelop
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Repository files navigation

RNA-FM: The RNA Foundation Model

Introduction

RNA-FM (RNA Foundation Model) is a state-of-the-artpretrained language model for RNA sequences, serving as the cornerstone of an integrated RNA research ecosystem. Trained on23+ million non-coding RNA (ncRNA) sequences via self-supervised learning, RNA-FM extracts comprehensive structural and functional information from RNA sequenceswithout relying on experimental labels.mRNA‑FM is a direct extension of RNA-FM, trained exclusively on 45 million mRNA coding sequences (CDS). It is specifically designed to capture information unique to mRNA and has demonstrated excellent performance in related tasks.Consequently, RNA-FM generatesgeneral-purpose RNA embeddings suitable for a broad range of downstream tasks, including but not limited to secondary and tertiary structure prediction, RNA family clustering, and functional RNA analysis.

Originally introduced inNature Methods as a foundational model for RNA biology, RNA-FM outperforms all evaluated single-sequence RNA language models across a wide reange of structure and function benchmarks, enabling unprecedented accuracy in RNA analysis. Building upon this foundation, our team developed anintegrated RNA pipeline that includes:

RhoFold – High-accuracy RNA tertiary structure prediction (sequence → structure).
RiboDiffusion – Diffusion-based inverse folding for RNA 3D design (structure → sequence).
RhoDesign – Geometric deep learning approach to RNA design (structure → sequence).

These tools work alongside RNA-FM topredict RNA structures from sequence, design new RNA sequences which could fold into desired 3D structures, and analyze functional properties. Our integrated ecosystem is built toadvance the development of RNA therapeutics, drive innovation in synthetic biology, and deepen our understandings of RNA structure-function relationships.

References

@article{chen2022interpretable,title={Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}

Table of Contents

Foundation Models and Extended Ecosystem

RNA-FM Ecosystem Components: Our platform comprises four integrated tools, each addressing a critical step in the RNA analysis and design pipeline:

Model	Task	Description	Code	Paper
RNA-FM	Foundation Model (Representation)	Pretrained transformer (BERT-style) for ncRNA sequences (for RNA-FM) and messenger RNA sequences (for mRNA-FM); extracts embeddings and predicts base-pairing probabilities	GitHub	Nature Methods
RhoFold	3D Structure Prediction	RNA-FM-powered model for sequence-to-structure prediction (3D coordinates + secondary structure)	GitHub	Nature Methods
RiboDiffusion	Inverse Folding	Generative diffusion model for structure-to-sequence RNA design	GitHub	ISMB'2024
RhoDesign	Inverse Folding	Geometric deep learning model (GVP+Transformer) for structure-to-sequence design	GitHub	Nature Computational Science

Foundation Models

Model	Training Corpus	# Sequences	Layers / Hidden	Params	Typical Use-cases
RNA-FM	non-coding RNAs	23.7 M	12 / 640	99 M	ncRNA structure & function, aptamer design
mRNA-FM	messenger RNAs	45 M	12 / 1280	239 M	mRNA expression modelling, CDS analysis

RNA-FM

RNA-FM is a 12-layer BERT encoder pre-trained with masked‐token prediction on23.7 M non-coding RNA sequences (RNAcentral100). It yields 640-d embeddings that already encode secondary-structure, 3-D proximity and even evolutionary signals, making it the representation backbone for every downstream tool in the ecosystem.
Click to fold RNA-FM details
- RNA-FM for Secondary Structure Prediction:
  - Outperforms classic physics-based and machine learning methods (e.g., LinearFold, SPOT-RNA, UFold) by up to20–30% in F1-score on challenging datasets.
  - Performance gains are especially notable for long RNAs (>150 nucleotides) and low-homology families

mRNA-FM

mRNA-FM, an extension of RNA-FM, is exclusively trained on 45 million mRNA coding sequences (CDS). Purpose-built to model mRNA-specific features, it achieves state-of-the-art performance in mRNA-related tasks.

Downstream Tools

RhoFold (Tertiary Structure Prediction)

RhoFold (Tertiary Structure Prediction) – An RNA-FM–powered predictor for RNA 3D structures. Given an RNA sequence, RhoFold rapidly predicts its tertiary structure (3D coordinates in PDB format) along with the secondary structure (CT file) and per-residue confidence scores. It achieves high accuracy on RNA 3D benchmarks by combining RNA-FM embeddings with a structure prediction network, significantly outperforming prior methods in the RNA-Puzzles challenge.
Click to expand RhoFold details
RhoFold leverages the powerful embeddings from RNA-FM to revolutionize RNA tertiary structure prediction. By combining deep learning with structural biology principles, RhoFold translates RNA sequences directly into accurate 3D coordinates. The model employs a multi-stage architecture that first converts RNA-FM's contextual representations into distance maps and torsion angles, then assembles these into complete three-dimensional structures. Unlike previous approaches that often struggle with RNA's complex folding landscapes, RhoFold's foundation model approach captures subtle sequence-structure relationships, enabling state-of-the-art performance on challenging benchmarks like RNA-Puzzles. The system works in both single-sequence mode for rapid predictions and can incorporate multiple sequence alignments (MSA) when higher accuracy is needed, making it versatile for various research applications from small RNAs to complex ribozymes and riboswitches.
- RhoFold for Tertiary Structure:
  - Delivers top accuracy on RNA-Puzzles / CASP-type tasks.
  - Predicts 3D structureswithin seconds (single-sequence mode) and integrates MSA for further accuracy gains.
  - AchievedNature Methods–level benchmarks, generalizing to novel RNA families.

RiboDiffusion (Inverse Folding – Diffusion)

RiboDiffusion (Inverse Folding – Diffusion) – A diffusion-based inverse folding model for RNA design. Starting from a target 3D backbone structure, RiboDiffusion iteratively generates RNA sequences that fold into that shape. This generative approach yields higher sequence recovery (≈11–16% improvement) than previous inverse folding algorithms, while offering tunable diversity in the designed sequences.
Click to expand RiboDiffusion details
RiboDiffusion represents a breakthrough in RNA inverse folding through diffusion-based generative modeling. While traditional RNA design methods often struggle with the vast sequence space, RiboDiffusion employs a novel approach inspired by recent advances in generative AI. Starting with random noise, the model iteratively refines RNA sequences to conform to target 3D backbones through a carefully controlled diffusion process. This approach allows RiboDiffusion to explore diverse sequence solutions while maintaining structural fidelity, a critical balance in biomolecular design. The diffusion framework inherently provides sequence diversity, enabling researchers to generate and test multiple candidate designs that all satisfy structural constraints. Published benchmarks demonstrate that RiboDiffusion achieves superior sequence recovery rates compared to previous methods, making it particularly valuable for designing functional RNAs like riboswitches, aptamers, and other structured elements where sequence-structure relationships are crucial.
- RiboDiffusion for Inverse Folding:
  - A diffusion-based generative approach that surpasses prior methods by~11–16% in sequence recovery rate.
  - Providestunable diversity in design, exploring multiple valid sequences for a single target shape.

RhoDesign (Inverse Folding – Deterministic)

RhoDesign (Inverse Folding – Deterministic) – A deterministic geometric deep learning model for RNA design. RhoDesign uses graph neural networks (GVP) and Transformers to directly decode sequences for a given 3D structure (optionally incorporating secondary structure constraints). It achieves state-of-the-art accuracy in matching target structures, with sequence recovery rates exceeding 50% on standard benchmarks (nearly double traditional methods) and the highest structural fidelity (TM-scores) among current solutions.
Click to expand RhoDesign details
RhoDesign introduces a deterministic approach to RNA inverse folding using geometric deep learning. Unlike diffusion-based methods, RhoDesign directly translates 3D structural information into RNA sequences through a specialized architecture combining Graph Vector Perceptrons (GVP) and Transformer networks. This architecture effectively captures both local geometric constraints and global structural patterns in RNA backbones. RhoDesign can incorporate optional secondary structure constraints, allowing researchers to specify certain base-pairing patterns while letting the model optimize the remaining sequence. Benchmark tests demonstrate that RhoDesign achieves remarkable sequence recovery rates exceeding 50% on standard datasets—nearly double the performance of traditional methods. Moreover, the designed sequences exhibit the highest structural fidelity (as measured by TM-score) among current approaches. This combination of accuracy and efficiency makes RhoDesign particularly suitable for precision RNA engineering applications where structural integrity is paramount.
- RhoDesign for Inverse Folding:
  - A deterministic GVP + Transformer model with>50% sequence recovery on standard 3D design benchmarks, nearly double that of older algorithms.
  - Achieves highest structural fidelity (TM-score) among tested methods, validated inNature Computational Science.

Unified Workflow: These tools operate in concert to enable end-to-end RNA engineering. For any RNA sequence of interest, one canpredict its structure (secondary and tertiary) using RNA-FM and RhoFold. Conversely, given a desired RNA structure, one candesign candidate sequences using RiboDiffusion or RhoDesign (or both for cross-validation). Designed sequences can then be validated by folding them back with RhoFold, closing the loop. This forward-and-inverse design cycle, all powered by RNA-FM embeddings, creates a powerful closed-loop workflow for exploring RNA structure-function space. By seamlessly integrating prediction and design, the RNA-FM ecosystem accelerates the design-build-test paradigm in RNA science, laying the groundwork for breakthroughs in RNA therapeutics, synthetic biology constructs, and our understanding of RNA biology.

Applications

RNA 3D Structure Prediction

Accurate RNA 3D structure prediction using a language-model–based deep learning approach – introducesRhoFold+, which couples RNA-FM embeddings with a geometry module to reach SOTA accuracy on CASP/RNA-Puzzles benchmarks (PAPER,CODE)
NuFold: end-to-end RNA tertiary-structure prediction – integrates RNA-FM features into a U-former backbone, achieving accuracy competitive with state-of-the-art fold predictors (PAPER,CODE)
TorRNA – improved backbone-torsion prediction by leveraging large language models – uses RNA-FM as sequence encoder and cuts median torsion-angle error by 2–16 % versus previous methods (PAPER)

RNA Design & Inverse Folding

Deep generative design of RNA aptamers using structural predictions – employsRhoDesign to create Mango aptamer variants with >3-fold fluorescence gain (wet-lab verified) (PAPER,CODE)
RiboDiffusion: tertiary-structure-based RNA inverse folding with generative diffusion models – diffusion sampler trained on RhoFold-generated data; boosts native-sequence recovery by 11 – 16 % over secondary-structure baselines (PAPER,CODE)
gRNAde: geometric deep learning for 3-D RNA inverse design – validates every design by forward-folding with RhoFold, achieving 56 % native-sequence recovery vs 45 % for Rosetta (PAPER,CODE)
RILLIE framework – integrates a 1.6 B-parameter RNA LM withRhoDesign for in-silico directed evolution of Broccoli/Pepper aptamers (CODE)

Functional Annotation & Subcellular Localisation

RNALoc-LM: RNA subcellular localisation prediction with a pre-trained RNA language model – replaces one-hot inputs with RNA-FM embeddings, raising MCC by 4–8 % for lncRNA, circRNA and miRNA localisation (PAPER,CODE)
PlantRNA-FM: an interpretable RNA foundation model for plant transcripts – adapts the RNA-FM architecture to >25 M plant RNAs; discovers translation-related structural motifs and attains F1 = 0.97 on genic-region annotation (PAPER,CODE)

RNA–Protein Interaction

ZHMolGraph: network-guided deep learning for RNA–protein interaction prediction – combines RNA-FM (for RNAs) and ProtTrans (for proteins) embeddings within a GNN, boosting AUROC by up to 28 % on unseen RNA–protein pairs (PAPER,CODE)

Take-away: Across structure prediction,de novo sequence design, functional annotation and interaction modelling, the community is steadily adoptingRNA-FM and itsRhoFold/RiboDiffusion/RhoDesign toolkit as reliable building blocks—demonstrating the ecosystem’s versatility and real-world impact.

Setup and Usage

Setup Environment with Conda

Below, we outline the environment setup forRNA-FM and its extended pipeline (e.g., RhoFold) locally.
(If you prefer not to install locally, refer to theOnline Server mentioned earlier.)

Clone the repository and create the Conda environment:

git clone https://github.com/ml4bio/RNA-FM.gitcd RNA-FMconda env create -f environment.yml

Activate and enter the workspace:

conda activate RNA-FMcd ./redevelop

Download pre-trained models from ourHugging Face repo and place the.pth files into thepretrained folder.
FormRNA-FM, ensure that your input RNA sequences have lengths multiple of 3 (codons) and place the specialized weights formRNA-FM in the samepretrained folder.

Quick Start Usage

Once the environment is ready and weights are downloaded, you can perform common tasks as follows:

1. Embedding Generation

UseRNA-FM to extract nucleotide-level embeddings for input sequences:

python launch/predict.py \    --config="pretrained/extract_embedding.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1 \    --save_embeddings

This command processes sequences inexample.fasta and saves 640-dimensional embeddings per nucleotide to./results/representations/.

Using mRNA-FM: To use the mRNA-FM variant instead of the default ncRNA model, add the model name argument and ensure input sequences are codon-aligned:
```
python launch/predict.py \    --config="pretrained/extract_embedding.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1 \    --save_embeddings \    --save_embeddings_format raw \    MODEL.BACKBONE_NAME mrna-fm
```
As For mRNA-FM, you can call it with an extra argument,MODEL.BACKBONE_NAME.RemembermRNA-FM uses codon tokenization, so each sequence must have a length divisible by 3.

2. RNA Secondary Structure Prediction

Predict an RNA secondary structure (base-pairing) from sequence using RNA-FM:

python launch/predict.py \    --config="pretrained/ss_prediction.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1

RNA-FM will output base-pair probability matrices (.npy) and secondary structures (.ct) to./results/r-ss.

Online Server

If you prefernot to install anything locally, you can use ourRNA-FM server. The server provides a simple web interface where you can:

Submit an RNA sequence to get its predicted secondary structure and/or embeddings.
Obtain results without needing local compute resources or setup.

(A separateRhoFold server is also available for tertiary structure prediction of single RNA sequences.)

Further Development & Python API

Tutorials

If you only want touse the pretrained model (rather than run all pipeline scripts), you can installRNA-FM directly:

pip install rna-fm

Alternatively, for the latest version from GitHub:

cd ./RNA-FMpip install.

RNA-FM

Then, loadRNA-FM within your own Python project:

importtorchimportfm# 1. Load RNA-FM modelmodel,alphabet=fm.pretrained.rna_fm_t12()batch_converter=alphabet.get_batch_converter()model.eval()# disables dropout for deterministic results# 2. Prepare datadata= [    ("RNA1","GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),    ("RNA2","GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("RNA3","CGAUUCNCGUUCCC--CCGCCUCCA"),]batch_labels,batch_strs,batch_tokens=batch_converter(data)# 3. Extract embeddings (on CPU)withtorch.no_grad():results=model(batch_tokens,repr_layers=[12])token_embeddings=results["representations"][12]

mRNA-FM

FormRNA-FM, load withfm.pretrained.mrna_fm_t12() and ensure input sequences are codon-aligned (as shown in the Quick Start above).

importtorchimportfm# 1. Load mRNA-FM modelmodel,alphabet=fm.pretrained.mrna_fm_t12()batch_converter=alphabet.get_batch_converter()model.eval()# disables dropout for deterministic results# 2. Prepare datadata= [    ("CDS1","AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),    ("CDS2","AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("CDS3","AUGCGAUUCNCGUUCCC--CCGCCUCC"),]batch_labels,batch_strs,batch_tokens=batch_converter(data)# 3. Extract embeddings (on CPU)withtorch.no_grad():results=model(batch_tokens,repr_layers=[12])token_embeddings=results["representations"][12]

More tutorials can be found fromGitHub. The related notebooks are stored in thetutorials folder.

Notebooks

Get started with RNA-FM through our comprehensive tutorials:

Tutorial	Description	Format
RNA Family Clustering & Type Classification	How to extract RNA-FM embeddings for clustering RNA families and classifying RNA types. This tutorial covers visualization of embeddings and training simple classifiers on top of them.	Jupyter Notebook
RNA Secondary Structure Prediction	How to use RNA-FM to predict RNA secondary structures, output base-pairing probability matrices, and visualize the predicted base-pairing (secondary structure).	Python Script
UTR Function Prediction	How to leverage RNA-FM embeddings to predict functional properties of untranslated regions (5′ and 3′ UTRs) in mRNAs. This includes training a model to predict gene expression or protein translation metrics from UTR sequences.	Jupyter Notebook
mRNA Expression Prediction	How to use mRNA-FM variant to predict gene expression levels from mRNA sequences. This tutorial demonstrates loading the specialized mRNA model, extracting embeddings, and building a classifier to differentiate between high and low expression genes.	Jupyter Notebook

Additional Resources:

Video Tutorial (Chinese) - Step-by-step guide to using RNA-FM for various RNA analysis tasks

These tutorials cover the core applications of RNA-FM from basic embedding extraction to advanced functional predictions. Each provides hands-on examples you can run immediately in your browser or local environment.

Usage Examples with the Ecosystem

We recommend exploring the advancedRhoFold,RiboDiffusion, andRhoDesign projects for tasks like 3D structure prediction or RNA design. Below arebrief usage samples:

Click to expand RNA-FM Ecosystem details

RhoFold (Sequence → Structure)

# Example: Predict 3D structure for an RNA sequence in FASTA.cd RhoFoldpython inference.py \    --input_fas ./example/input/5t5a.fasta \    --output_dir ./example/output/5t5a/ \    --ckpt ./pretrained/RhoFold_pretrained.pt

Outputs:

unrelaxed_model.pdb /relaxed_1000_model.pdb (3D coordinates)
ss.ct (secondary structure)
results.npz (distance/angle predictions + confidence scores)
log.txt (run logs, pLDDT, etc.)

RiboDiffusion (Structure → Sequence)

cd RiboDiffusionCUDA_VISIBLE_DEVICES=0 python main.py \    --PDB_file examples/R1107.pdb \    --config.eval.n_samples 5

This will generate 5 candidate RNA sequences that fold into the structure provided inR1107.pdb. The output FASTA files will be saved under theexp_inf/fasta/ directory.

RhoDesign (Structure → Sequence)

cd RhoDesignpython src/inference.py \    --pdb ../example/2zh6_B.pdb \    --ss ../example/2zh6_B.npy \    --save ../example/

This produces a designed RNA sequence predicted to fold into the target 3D shape (PDB file2zh6_B.pdb, with an optional secondary structure constraint from2zh6_B.npy). The output sequence will be saved in the specified folder. You can adjust parameters like the sampling temperature to explore more diverse or high-fidelity designs.

API Reference

Each project in the RNA-FM ecosystem comes with both command-line interfaces and Python modules:

RNA-FM: Core modulefm for embedding extraction and secondary structure prediction.
- fm.pretrained.rna_fm_t12() – load the 12-layer ncRNA model
- fm.pretrained.mrna_fm_t12() – load the 12-layer mRNA (codon) model
RhoFold: Use theRhoFoldModel class or theinference.py script.
- inference.py takes a FASTA sequence (and optionally an MSA) and outputs a 3D structure.
- Add--single_seq_pred True to run without an MSA (single-sequence mode).
RiboDiffusion: Use themain.py script or import the diffusion model classes.
- main.py takes a PDB structure as input and outputs designed sequences.
- Modify settings inconfigs/ (e.g.,cond_scale,n_samples) to tune the generation.
RhoDesign: Use theinference.py script or import the design model module.
- inference.py takes a PDB (and optional secondary structure/contact map) and outputs a designed sequence.
- The GVP+Transformer architecture can incorporate partial structure constraints and supports advanced sampling strategies.

For further details, see each repo’s documentation or the notebooks in thetutorials folder.

Related RNA Language Models

Name	Dataset	Modality	Tokenization	Architecture	Backbone	Pre‑training Task	Layers	Model Params	Data Size	Code	Weights	Data	License
RNA‑FM	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM	12	100 M	23 M	GitHub	HuggingFace	RNAcentral	MIT
RNABERT	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM / SAL	6	0.5 M	0.76 M	GitHub	Drive	Rfam 14.3	MIT
RNA‑MSM	ncRNA	Seq + MSA	Base	Enc‑only	MSA‑Transformer	MLM	12	95 M	3932 families	GitHub	Drive	Rfam 14.7	MIT
AIDO.RNA	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM	32	1.6 B	42 M	GitHub	HuggingFace	Public ncRNA mix	Apache‑2.0
ERNIE‑RNA	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM	12	86 M	20.4 M	GitHub	GitHub	Rfam + RNAcentral	MIT
GenerRNA	ncRNA	Sequence	BPE	Dec‑only	Transformer	CLM	24	350 M	16.09 M	GitHub	HuggingFace	Public ncRNA mix	Apache‑2.0
RFamLlama	ncRNA	Sequence	Base	Dec‑only	Llama	CLM	6‑10	13‑88 M	0.6 M	HuggingFace	HuggingFace	Rfam 14.10	CC BY‑NC‑4.0
RNA‑km	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM	12	152 M	23 M	GitHub	Drive	Rfam + RNAcentral	MIT
RNAErnie	ncRNA	Sequence	Base	Enc‑only	Transformer	MLM	12	105 M	23 M	GitHub	GitHub	Public ncRNA mix	Apache‑2.0
OPED	pegRNA	Sequence	k‑mer	Enc‑Dec	Transformer	Regression	n/a	n/a	40 k	GitHub	—	Public pegRNA eff.	MIT
GARNET	rRNA	Sequence	k‑mer	Dec‑only	Transformer	CLM	18	19 M	89 M tokens	GitHub	Release	Public rRNA	MIT
IsoCLR	pre‑mRNA	Sequence	One‑hot	Enc‑only	CNN	Contrast Learning	8	1‑10 M	1 M	GitHub	—	Ensembl / RefSeq	—
SpliceBERT	pre‑mRNA	Sequence	Base	Enc‑only	Transformer	MLM	6	20 M	2 M	GitHub	Zenodo	UCSC/GENCODE	MIT
Orthrus	pre‑mRNA	Sequence	Base	Enc‑only	Mamba	Contrast Learning	3‑6	1‑10 M	49 M	GitHub	HuggingFace	Ortholog set	Apache‑2.0
LoRNA	pre‑mRNA	Sequence	Base	Dec‑only	StripedHyena	Contrast Learning	16	6.5 M	100 M	GitHub	(announced)	SRA (long‑read)	MIT
CodonBERT	mRNA CDS	Sequence	Codon	Enc‑only	Transformer	MLM / HSP	12	87 M	10 M	GitHub	HuggingFace	NCBI mRNA	Apache‑2.0
UTR‑LM	5′UTR	Sequence	Base	Enc‑only	Transformer	MLM / SSP / MFE	6	1 M	0.7 M	GitHub	GitHub	Public 5′UTR set	MIT
3UTRBERT	3′UTR	Sequence	k‑mer	Enc‑only	Transformer	MLM	12	86 M	20 k	GitHub	HuggingFace	Public 3′UTR	MIT
G4mer	mRNA	Sequence	k‑mer	Enc‑only	Transformer	MLM	6	—	—	—	—	—	—
HELM	mRNA	Sequence	Codon	Multi	Multi	MLM + CLM	—	50 M	15.3 M	—	—	—	—
RiNALMo	RNA	Sequence	Base	Enc‑only	Transformer	MLM	33	135‑650 M	36 M	GitHub	(request)	Public ncRNA	MIT
UNI‑RNA	RNA	Sequence	Base	Enc‑only	Transformer	MLM	24	400 M	500 M	—	—	—	—
ATOM‑1	RNA	Sequence	Base	Enc‑Dec	Transformer	—	—	—	—	—	—	—	—
BiRNA‑BERT	RNA	Sequence	Base + BPE	Enc‑only	Transformer	MLM	12	117 M	36 M	GitHub	HuggingFace	Public ncRNA	MIT
ChaRNABERT	RNA	Sequence	GBST	Enc‑only	Transformer	MLM	6‑33	8‑650 M	62 M	—	(8 M demo)	Public ncRNA	—
DGRNA	RNA	Sequence	Base	Enc‑only	Mamba	MLM	12	100 M	100 M	—	—	—	—
LAMAR	RNA	Sequence	Base	Enc‑only	Transformer	MLM	12	150 M	15 M	GitHub	(announced)	Public ncRNA	MIT
OmniGenome	RNA	Sequence, Structure	Base	Enc‑only	Transformer	MLM / Seq2Str / Str2Seq	16‑32	52‑186 M	25 M	GitHub	HuggingFace	Public multi‑omics	Apache‑2.0
PlantRNA‑FM	RNA	Sequence, Structure	Base	Enc‑only	Transformer	MLM / SSP / CLS	12	35 M	25 M	HuggingFace	HuggingFace	Plant RNA set	CC BY‑NC‑4.0
MP‑RNA	RNA	Sequence, Structure	Base	Enc‑only	Transformer	SSP / SNMR / MRLM	12	52‑186 M	25 M	GitHub	(planned)	Public ncRNA mix	Apache‑2.0

Citations

If you use RNA-FM or any components of this ecosystem in your research, please cite the relevant papers. Below is a collection of key publications (in BibTeX format) covering the foundation model and associated tools:

BibTeX Citations

RNA-FM & RNA Structure Predictions

@article{chen2022interpretable,title={Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and Shen, Tao and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}@article{shen2024accurate,title={Accurate RNA 3D structure prediction using a language model-based deep learning approach},author={Shen, Tao and Hu, Zhihang and Sun, Siqi and Liu, Di and Wong, Felix and Wang, Jiuming and Chen, Jiayang and Wang, Yixuan and Hong, Liang and Xiao, Jin and others},journal={Nature Methods},pages={1--12},year={2024},publisher={Nature Publishing Group US New York}}@article{chen2020rna,title={RNA secondary structure prediction by learning unrolled algorithms},author={Chen, Xinshi and Li, Yu and Umarov, Ramzan and Gao, Xin and Song, Le},journal={arXiv preprint arXiv:2002.05810},year={2020}}@article{WANG2025102991,title ={Deep learning for RNA structure prediction},author ={Jiuming Wang and Yimin Fan and Liang Hong and Zhihang Hu and Yu Li},journal ={Current Opinion in Structural Biology},year ={2025},doi ={https://doi.org/10.1016/j.sbi.2025.102991},url ={https://www.sciencedirect.com/science/article/pii/S0959440X25000090},}

RNA Design & Inverse Folding

@article{wong2024deep,title={Deep generative design of RNA aptamers using structural predictions},author={Wong, Felix and He, Dongchen and Krishnan, Aarti and Hong, Liang and Wang, Alexander Z and Wang, Jiuming and Hu, Zhihang and Omori, Satotaka and Li, Alicia and Rao, Jiahua and others},journal={Nature Computational Science},pages={1--11},year={2024},publisher={Nature Publishing Group US New York}}@article{huang2024ribodiffusion,title={RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models},author={Huang, Han and Lin, Ziqian and He, Dongchen and Hong, Liang and Li, Yu},journal={Bioinformatics},volume={40},number={Supplement\_1},pages={i347--i356},year={2024},publisher={Oxford University Press}}

RNA-Protein Interaction (RPI)

@article{wei2022protein,title={Protein--RNA interaction prediction with deep learning: structure matters},author={Wei, Junkang and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},journal={Briefings in bioinformatics},volume={23},number={1},pages={bbab540},year={2022},publisher={Oxford University Press}}@article{lam2019deep,title={A deep learning framework to predict binding preference of RNA constituents on protein surface},author={Lam, Jordy Homing and Li, Yu and Zhu, Lizhe and Umarov, Ramzan and Jiang, Hanlun and H{\'e}liou, Am{\'e}lie and Sheong, Fu Kit and Liu, Tianyun and Long, Yongkang and Li, Yunfei and others},journal={Nature communications},volume={10},number={1},pages={4941},year={2019},publisher={Nature Publishing Group UK London}}

Databases & Resources

@article{wei2024pronet,title={ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles},author={Wei, Junkang and Xiao, Jin and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},journal={Database},volume={2024},pages={baae012},year={2024},publisher={Oxford University Press UK}}

Single-Cell RNA Analysis

@article{han2022self,title={Self-supervised contrastive learning for integrative single cell RNA-seq data analysis},author={Han, Wenkai and Cheng, Yuqi and Chen, Jiayang and Zhong, Huawen and Hu, Zhihang and Chen, Siyuan and Zong, Licheng and Hong, Liang and Chan, Ting-Fung and King, Irwin and others},journal={Briefings in Bioinformatics},volume={23},number={5},pages={bbac377},year={2022},publisher={Oxford University Press}}

Drug Discovery

@article{fan2022highly,title={The highly conserved RNA-binding specificity of nucleocapsid protein facilitates the identification of drugs with broad anti-coronavirus activity},author={Fan, Shaorong and Sun, Wenju and Fan, Ligang and Wu, Nan and Sun, Wei and Ma, Haiqian and Chen, Siyuan and Li, Zitong and Li, Yu and Zhang, Jilin and others},journal={Computational and Structural Biotechnology Journal},volume={20},pages={5040--5044},year={2022},publisher={Elsevier}}

License

This source code is licensed under theMIT license found in theLICENSE file in the root directory of this source tree.

Our framework and model training were inspired by:

esm (Facebook’s protein language modeling framework)
fairseq (PyTorch sequence modeling framework)

We thank the authors of these works for providing excellent foundations for RNA-FM.

Thank you for using RNA-FM!
For issues or questions, open a GitHubIssue or consult thedocumentation. We welcome contributions and collaboration from the community.

About

Nature Methods: RNA foundation model (together with RhoFold)

ml4bio.github.io/RNA-FM/

Movatterモバイル変換

License

ml4bio/RNA-FM

Folders and files

Latest commit

History

Repository files navigation

RNA-FM: The RNA Foundation Model

Introduction

Foundation Models and Extended Ecosystem

Foundation Models

RNA-FM

mRNA-FM

Downstream Tools

RhoFold (Tertiary Structure Prediction)

RiboDiffusion (Inverse Folding – Diffusion)

RhoDesign (Inverse Folding – Deterministic)

Applications

RNA 3D Structure Prediction

RNA Design & Inverse Folding

Functional Annotation & Subcellular Localisation

RNA–Protein Interaction

Setup and Usage

Setup Environment with Conda

Quick Start Usage

1. Embedding Generation

2. RNA Secondary Structure Prediction

Online Server

Further Development & Python API

Tutorials

RNA-FM

mRNA-FM

Notebooks

Usage Examples with the Ecosystem

RhoFold (Sequence → Structure)

RiboDiffusion (Structure → Sequence)

RhoDesign (Structure → Sequence)

API Reference

API Reference

Related RNA Language Models

Citations

RNA-FM & RNA Structure Predictions

RNA Design & Inverse Folding

RNA-Protein Interaction (RPI)

Databases & Resources

Single-Cell RNA Analysis

Drug Discovery

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Languages

Packages