Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Nature Methods: RNA foundation model (together with RhoFold)

License

NotificationsYou must be signed in to change notification settings

ml4bio/RNA-FM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pic

arXivNature MethodsNature Computational ScienceBioinformaticsRNA-FM ServerRhoFold Server

Introduction

RNA-FM (RNA Foundation Model) is a state-of-the-artpretrained language model for RNA sequences, serving as the cornerstone of an integrated RNA research ecosystem. Trained on23+ million non-coding RNA (ncRNA) sequences via self-supervised learning, RNA-FM extracts comprehensive structural and functional information from RNA sequenceswithout relying on experimental labels.mRNA‑FM is a direct extension of RNA-FM, trained exclusively on 45 million mRNA coding sequences (CDS). It is specifically designed to capture information unique to mRNA and has demonstrated excellent performance in related tasks.Consequently, RNA-FM generatesgeneral-purpose RNA embeddings suitable for a broad range of downstream tasks, including but not limited to secondary and tertiary structure prediction, RNA family clustering, and functional RNA analysis.

Originally introduced inNature Methods as a foundational model for RNA biology, RNA-FM outperforms all evaluated single-sequence RNA language models across a wide reange of structure and function benchmarks, enabling unprecedented accuracy in RNA analysis. Building upon this foundation, our team developed anintegrated RNA pipeline that includes:

  • RhoFold – High-accuracy RNA tertiary structure prediction (sequence → structure).
  • RiboDiffusion – Diffusion-based inverse folding for RNA 3D design (structure → sequence).
  • RhoDesign – Geometric deep learning approach to RNA design (structure → sequence).

These tools work alongside RNA-FM topredict RNA structures from sequence, design new RNA sequences which could fold into desired 3D structures, and analyze functional properties. Our integrated ecosystem is built toadvance the development of RNA therapeutics, drive innovation in synthetic biology, and deepen our understandings of RNA structure-function relationships.

References
@article{chen2022interpretable,title={Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}
Table of Contents

Foundation Models and Extended Ecosystem

RNA-FM Ecosystem Components: Our platform comprises four integrated tools, each addressing a critical step in the RNA analysis and design pipeline:

ModelTaskDescriptionCodePaper
RNA-FMFoundation Model (Representation)Pretrained transformer (BERT-style) for ncRNA sequences (for RNA-FM) and messenger RNA sequences (for mRNA-FM); extracts embeddings and predicts base-pairing probabilitiesGitHubNature Methods
RhoFold3D Structure PredictionRNA-FM-powered model for sequence-to-structure prediction (3D coordinates + secondary structure)GitHubNature Methods
RiboDiffusionInverse FoldingGenerative diffusion model for structure-to-sequence RNA designGitHubISMB'2024
RhoDesignInverse FoldingGeometric deep learning model (GVP+Transformer) for structure-to-sequence designGitHubNature Computational Science

Foundation Models

ModelTraining Corpus# SequencesLayers / HiddenParamsTypical Use-cases
RNA-FMnon-coding RNAs23.7 M12 / 64099 MncRNA structure & function, aptamer design
mRNA-FMmessenger RNAs45 M12 / 1280239 MmRNA expression modelling, CDS analysis

RNA-FM

  • RNA-FM is a 12-layer BERT encoder pre-trained with masked‐token prediction on23.7 M non-coding RNA sequences (RNAcentral100). It yields 640-d embeddings that already encode secondary-structure, 3-D proximity and even evolutionary signals, making it the representation backbone for every downstream tool in the ecosystem.

    Click to fold RNA-FM details

    CUHKServerarXiv

    RNA-FM Overview

    • RNA-FM for Secondary Structure Prediction:
      • Outperforms classic physics-based and machine learning methods (e.g., LinearFold, SPOT-RNA, UFold) by up to20–30% in F1-score on challenging datasets.
      • Performance gains are especially notable for long RNAs (>150 nucleotides) and low-homology families

mRNA-FM

  • mRNA-FM, an extension of RNA-FM, is exclusively trained on 45 million mRNA coding sequences (CDS). Purpose-built to model mRNA-specific features, it achieves state-of-the-art performance in mRNA-related tasks.

Downstream Tools

RhoFold (Tertiary Structure Prediction)

  • RhoFold (Tertiary Structure Prediction) – An RNA-FM–powered predictor for RNA 3D structures. Given an RNA sequence, RhoFold rapidly predicts its tertiary structure (3D coordinates in PDB format) along with the secondary structure (CT file) and per-residue confidence scores. It achieves high accuracy on RNA 3D benchmarks by combining RNA-FM embeddings with a structure prediction network, significantly outperforming prior methods in the RNA-Puzzles challenge.

    Click to expand RhoFold details

    CUHKServerNature Methods

    RhoFold leverages the powerful embeddings from RNA-FM to revolutionize RNA tertiary structure prediction. By combining deep learning with structural biology principles, RhoFold translates RNA sequences directly into accurate 3D coordinates. The model employs a multi-stage architecture that first converts RNA-FM's contextual representations into distance maps and torsion angles, then assembles these into complete three-dimensional structures. Unlike previous approaches that often struggle with RNA's complex folding landscapes, RhoFold's foundation model approach captures subtle sequence-structure relationships, enabling state-of-the-art performance on challenging benchmarks like RNA-Puzzles. The system works in both single-sequence mode for rapid predictions and can incorporate multiple sequence alignments (MSA) when higher accuracy is needed, making it versatile for various research applications from small RNAs to complex ribozymes and riboswitches.

    RhoFlod Overview

    • RhoFold for Tertiary Structure:
      • Delivers top accuracy on RNA-Puzzles / CASP-type tasks.
      • Predicts 3D structureswithin seconds (single-sequence mode) and integrates MSA for further accuracy gains.
      • AchievedNature Methods–level benchmarks, generalizing to novel RNA families.

RiboDiffusion (Inverse Folding – Diffusion)

  • RiboDiffusion (Inverse Folding – Diffusion) – A diffusion-based inverse folding model for RNA design. Starting from a target 3D backbone structure, RiboDiffusion iteratively generates RNA sequences that fold into that shape. This generative approach yields higher sequence recovery (≈11–16% improvement) than previous inverse folding algorithms, while offering tunable diversity in the designed sequences.

    Click to expand RiboDiffusion details

    Bioinformatics

    RiboDiffusion represents a breakthrough in RNA inverse folding through diffusion-based generative modeling. While traditional RNA design methods often struggle with the vast sequence space, RiboDiffusion employs a novel approach inspired by recent advances in generative AI. Starting with random noise, the model iteratively refines RNA sequences to conform to target 3D backbones through a carefully controlled diffusion process. This approach allows RiboDiffusion to explore diverse sequence solutions while maintaining structural fidelity, a critical balance in biomolecular design. The diffusion framework inherently provides sequence diversity, enabling researchers to generate and test multiple candidate designs that all satisfy structural constraints. Published benchmarks demonstrate that RiboDiffusion achieves superior sequence recovery rates compared to previous methods, making it particularly valuable for designing functional RNAs like riboswitches, aptamers, and other structured elements where sequence-structure relationships are crucial.

    Overview

    • RiboDiffusion for Inverse Folding:
      • A diffusion-based generative approach that surpasses prior methods by~11–16% in sequence recovery rate.
      • Providestunable diversity in design, exploring multiple valid sequences for a single target shape.

RhoDesign (Inverse Folding – Deterministic)

  • RhoDesign (Inverse Folding – Deterministic) – A deterministic geometric deep learning model for RNA design. RhoDesign uses graph neural networks (GVP) and Transformers to directly decode sequences for a given 3D structure (optionally incorporating secondary structure constraints). It achieves state-of-the-art accuracy in matching target structures, with sequence recovery rates exceeding 50% on standard benchmarks (nearly double traditional methods) and the highest structural fidelity (TM-scores) among current solutions.

    Click to expand RhoDesign details

    Nature Computational Science

    RhoDesign introduces a deterministic approach to RNA inverse folding using geometric deep learning. Unlike diffusion-based methods, RhoDesign directly translates 3D structural information into RNA sequences through a specialized architecture combining Graph Vector Perceptrons (GVP) and Transformer networks. This architecture effectively captures both local geometric constraints and global structural patterns in RNA backbones. RhoDesign can incorporate optional secondary structure constraints, allowing researchers to specify certain base-pairing patterns while letting the model optimize the remaining sequence. Benchmark tests demonstrate that RhoDesign achieves remarkable sequence recovery rates exceeding 50% on standard datasets—nearly double the performance of traditional methods. Moreover, the designed sequences exhibit the highest structural fidelity (as measured by TM-score) among current approaches. This combination of accuracy and efficiency makes RhoDesign particularly suitable for precision RNA engineering applications where structural integrity is paramount.

    Overview

    • RhoDesign for Inverse Folding:
      • A deterministic GVP + Transformer model with>50% sequence recovery on standard 3D design benchmarks, nearly double that of older algorithms.
      • Achieves highest structural fidelity (TM-score) among tested methods, validated inNature Computational Science.

Unified Workflow: These tools operate in concert to enable end-to-end RNA engineering. For any RNA sequence of interest, one canpredict its structure (secondary and tertiary) using RNA-FM and RhoFold. Conversely, given a desired RNA structure, one candesign candidate sequences using RiboDiffusion or RhoDesign (or both for cross-validation). Designed sequences can then be validated by folding them back with RhoFold, closing the loop. This forward-and-inverse design cycle, all powered by RNA-FM embeddings, creates a powerful closed-loop workflow for exploring RNA structure-function space. By seamlessly integrating prediction and design, the RNA-FM ecosystem accelerates the design-build-test paradigm in RNA science, laying the groundwork for breakthroughs in RNA therapeutics, synthetic biology constructs, and our understanding of RNA biology.


Applications

RNA 3D Structure Prediction

  • Accurate RNA 3D structure prediction using a language-model–based deep learning approach – introducesRhoFold+, which couples RNA-FM embeddings with a geometry module to reach SOTA accuracy on CASP/RNA-Puzzles benchmarks (PAPER,CODE)
  • NuFold: end-to-end RNA tertiary-structure prediction – integrates RNA-FM features into a U-former backbone, achieving accuracy competitive with state-of-the-art fold predictors (PAPER,CODE)
  • TorRNA – improved backbone-torsion prediction by leveraging large language models – uses RNA-FM as sequence encoder and cuts median torsion-angle error by 2–16 % versus previous methods (PAPER)

RNA Design & Inverse Folding

  • Deep generative design of RNA aptamers using structural predictions – employsRhoDesign to create Mango aptamer variants with >3-fold fluorescence gain (wet-lab verified) (PAPER,CODE)
  • RiboDiffusion: tertiary-structure-based RNA inverse folding with generative diffusion models – diffusion sampler trained on RhoFold-generated data; boosts native-sequence recovery by 11 – 16 % over secondary-structure baselines (PAPER,CODE)
  • gRNAde: geometric deep learning for 3-D RNA inverse design – validates every design by forward-folding with RhoFold, achieving 56 % native-sequence recovery vs 45 % for Rosetta (PAPER,CODE)
  • RILLIE framework – integrates a 1.6 B-parameter RNA LM withRhoDesign for in-silico directed evolution of Broccoli/Pepper aptamers (CODE)

Functional Annotation & Subcellular Localisation

  • RNALoc-LM: RNA subcellular localisation prediction with a pre-trained RNA language model – replaces one-hot inputs with RNA-FM embeddings, raising MCC by 4–8 % for lncRNA, circRNA and miRNA localisation (PAPER,CODE)
  • PlantRNA-FM: an interpretable RNA foundation model for plant transcripts – adapts the RNA-FM architecture to >25 M plant RNAs; discovers translation-related structural motifs and attains F1 = 0.97 on genic-region annotation (PAPER,CODE)

RNA–Protein Interaction

  • ZHMolGraph: network-guided deep learning for RNA–protein interaction prediction – combines RNA-FM (for RNAs) and ProtTrans (for proteins) embeddings within a GNN, boosting AUROC by up to 28 % on unseen RNA–protein pairs (PAPER,CODE)

Take-away: Across structure prediction,de novo sequence design, functional annotation and interaction modelling, the community is steadily adoptingRNA-FM and itsRhoFold/RiboDiffusion/RhoDesign toolkit as reliable building blocks—demonstrating the ecosystem’s versatility and real-world impact.


Setup and Usage

Setup Environment with Conda

Below, we outline the environment setup forRNA-FM and its extended pipeline (e.g., RhoFold) locally.
(If you prefer not to install locally, refer to theOnline Server mentioned earlier.)

  1. Clone the repository and create the Conda environment:
git clone https://github.com/ml4bio/RNA-FM.gitcd RNA-FMconda env create -f environment.yml
  1. Activate and enter the workspace:
conda activate RNA-FMcd ./redevelop
  1. Download pre-trained models from ourHugging Face repo and place the.pth files into thepretrained folder.

    FormRNA-FM, ensure that your input RNA sequences have lengths multiple of 3 (codons) and place the specialized weights formRNA-FM in the samepretrained folder.

Quick Start Usage

Once the environment is ready and weights are downloaded, you can perform common tasks as follows:

1. Embedding Generation

UseRNA-FM to extract nucleotide-level embeddings for input sequences:

python launch/predict.py \    --config="pretrained/extract_embedding.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1 \    --save_embeddings

This command processes sequences inexample.fasta and saves 640-dimensional embeddings per nucleotide to./results/representations/.

  • Using mRNA-FM: To use the mRNA-FM variant instead of the default ncRNA model, add the model name argument and ensure input sequences are codon-aligned:

    python launch/predict.py \    --config="pretrained/extract_embedding.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1 \    --save_embeddings \    --save_embeddings_format raw \    MODEL.BACKBONE_NAME mrna-fm

    As For mRNA-FM, you can call it with an extra argument,MODEL.BACKBONE_NAME.RemembermRNA-FM uses codon tokenization, so each sequence must have a length divisible by 3.

2. RNA Secondary Structure Prediction

Predict an RNA secondary structure (base-pairing) from sequence using RNA-FM:

python launch/predict.py \    --config="pretrained/ss_prediction.yml" \    --data_path="./data/examples/example.fasta" \    --save_dir="./results" \    --save_frequency 1

RNA-FM will output base-pair probability matrices (.npy) and secondary structures (.ct) to./results/r-ss.

Online Server

RNA-FM ServerRhoFold Server

If you prefernot to install anything locally, you can use ourRNA-FM server. The server provides a simple web interface where you can:

  • Submit an RNA sequence to get its predicted secondary structure and/or embeddings.
  • Obtain results without needing local compute resources or setup.

(A separateRhoFold server is also available for tertiary structure prediction of single RNA sequences.)

Further Development & Python API

Tutorials

If you only want touse the pretrained model (rather than run all pipeline scripts), you can installRNA-FM directly:

pip install rna-fm

Alternatively, for the latest version from GitHub:

cd ./RNA-FMpip install.

RNA-FM

Then, loadRNA-FM within your own Python project:

importtorchimportfm# 1. Load RNA-FM modelmodel,alphabet=fm.pretrained.rna_fm_t12()batch_converter=alphabet.get_batch_converter()model.eval()# disables dropout for deterministic results# 2. Prepare datadata= [    ("RNA1","GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),    ("RNA2","GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("RNA3","CGAUUCNCGUUCCC--CCGCCUCCA"),]batch_labels,batch_strs,batch_tokens=batch_converter(data)# 3. Extract embeddings (on CPU)withtorch.no_grad():results=model(batch_tokens,repr_layers=[12])token_embeddings=results["representations"][12]

mRNA-FM

FormRNA-FM, load withfm.pretrained.mrna_fm_t12() and ensure input sequences are codon-aligned (as shown in the Quick Start above).

importtorchimportfm# 1. Load mRNA-FM modelmodel,alphabet=fm.pretrained.mrna_fm_t12()batch_converter=alphabet.get_batch_converter()model.eval()# disables dropout for deterministic results# 2. Prepare datadata= [    ("CDS1","AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),    ("CDS2","AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),    ("CDS3","AUGCGAUUCNCGUUCCC--CCGCCUCC"),]batch_labels,batch_strs,batch_tokens=batch_converter(data)# 3. Extract embeddings (on CPU)withtorch.no_grad():results=model(batch_tokens,repr_layers=[12])token_embeddings=results["representations"][12]

More tutorials can be found fromGitHub. The related notebooks are stored in thetutorials folder.

Notebooks

Get started with RNA-FM through our comprehensive tutorials:

TutorialDescriptionFormat
RNA Family Clustering & Type Classification
Open In Colab
How to extract RNA-FM embeddings for clustering RNA families and classifying RNA types. This tutorial covers visualization of embeddings and training simple classifiers on top of them.Jupyter Notebook
RNA Secondary Structure PredictionHow to use RNA-FM to predict RNA secondary structures, output base-pairing probability matrices, and visualize the predicted base-pairing (secondary structure).Python Script
UTR Function Prediction
Open In Colab
How to leverage RNA-FM embeddings to predict functional properties of untranslated regions (5′ and 3′ UTRs) in mRNAs. This includes training a model to predict gene expression or protein translation metrics from UTR sequences.Jupyter Notebook
mRNA Expression Prediction
Open In Colab
How to use mRNA-FM variant to predict gene expression levels from mRNA sequences. This tutorial demonstrates loading the specialized mRNA model, extracting embeddings, and building a classifier to differentiate between high and low expression genes.Jupyter Notebook

Additional Resources:

These tutorials cover the core applications of RNA-FM from basic embedding extraction to advanced functional predictions. Each provides hands-on examples you can run immediately in your browser or local environment.

Usage Examples with the Ecosystem

We recommend exploring the advancedRhoFold,RiboDiffusion, andRhoDesign projects for tasks like 3D structure prediction or RNA design. Below arebrief usage samples:

Click to expand RNA-FM Ecosystem details

RhoFold (Sequence → Structure)

# Example: Predict 3D structure for an RNA sequence in FASTA.cd RhoFoldpython inference.py \    --input_fas ./example/input/5t5a.fasta \    --output_dir ./example/output/5t5a/ \    --ckpt ./pretrained/RhoFold_pretrained.pt

Outputs:

  • unrelaxed_model.pdb /relaxed_1000_model.pdb (3D coordinates)
  • ss.ct (secondary structure)
  • results.npz (distance/angle predictions + confidence scores)
  • log.txt (run logs, pLDDT, etc.)

RiboDiffusion (Structure → Sequence)

cd RiboDiffusionCUDA_VISIBLE_DEVICES=0 python main.py \    --PDB_file examples/R1107.pdb \    --config.eval.n_samples 5

This will generate 5 candidate RNA sequences that fold into the structure provided inR1107.pdb. The output FASTA files will be saved under theexp_inf/fasta/ directory.

RhoDesign (Structure → Sequence)

cd RhoDesignpython src/inference.py \    --pdb ../example/2zh6_B.pdb \    --ss ../example/2zh6_B.npy \    --save ../example/

This produces a designed RNA sequence predicted to fold into the target 3D shape (PDB file2zh6_B.pdb, with an optional secondary structure constraint from2zh6_B.npy). The output sequence will be saved in the specified folder. You can adjust parameters like the sampling temperature to explore more diverse or high-fidelity designs.

API Reference

API Reference

Each project in the RNA-FM ecosystem comes with both command-line interfaces and Python modules:

  • RNA-FM: Core modulefm for embedding extraction and secondary structure prediction.
    • fm.pretrained.rna_fm_t12() – load the 12-layer ncRNA model
    • fm.pretrained.mrna_fm_t12() – load the 12-layer mRNA (codon) model
  • RhoFold: Use theRhoFoldModel class or theinference.py script.
    • inference.py takes a FASTA sequence (and optionally an MSA) and outputs a 3D structure.
    • Add--single_seq_pred True to run without an MSA (single-sequence mode).
  • RiboDiffusion: Use themain.py script or import the diffusion model classes.
    • main.py takes a PDB structure as input and outputs designed sequences.
    • Modify settings inconfigs/ (e.g.,cond_scale,n_samples) to tune the generation.
  • RhoDesign: Use theinference.py script or import the design model module.
    • inference.py takes a PDB (and optional secondary structure/contact map) and outputs a designed sequence.
    • The GVP+Transformer architecture can incorporate partial structure constraints and supports advanced sampling strategies.

For further details, see each repo’s documentation or the notebooks in thetutorials folder.


Related RNA Language Models

NameDatasetModalityTokenizationArchitectureBackbonePre‑training TaskLayersModel ParamsData SizeCodeWeightsDataLicense
RNA‑FMncRNASequenceBaseEnc‑onlyTransformerMLM12100 M23 MGitHubHuggingFaceRNAcentralMIT
RNABERTncRNASequenceBaseEnc‑onlyTransformerMLM / SAL60.5 M0.76 MGitHubDriveRfam 14.3MIT
RNA‑MSMncRNASeq + MSABaseEnc‑onlyMSA‑TransformerMLM1295 M3932 familiesGitHubDriveRfam 14.7MIT
AIDO.RNAncRNASequenceBaseEnc‑onlyTransformerMLM321.6 B42 MGitHubHuggingFacePublic ncRNA mixApache‑2.0
ERNIE‑RNAncRNASequenceBaseEnc‑onlyTransformerMLM1286 M20.4 MGitHubGitHubRfam + RNAcentralMIT
GenerRNAncRNASequenceBPEDec‑onlyTransformerCLM24350 M16.09 MGitHubHuggingFacePublic ncRNA mixApache‑2.0
RFamLlamancRNASequenceBaseDec‑onlyLlamaCLM6‑1013‑88 M0.6 MHuggingFaceHuggingFaceRfam 14.10CC BY‑NC‑4.0
RNA‑kmncRNASequenceBaseEnc‑onlyTransformerMLM12152 M23 MGitHubDriveRfam + RNAcentralMIT
RNAErniencRNASequenceBaseEnc‑onlyTransformerMLM12105 M23 MGitHubGitHubPublic ncRNA mixApache‑2.0
OPEDpegRNASequencek‑merEnc‑DecTransformerRegressionn/an/a40 kGitHubPublic pegRNA eff.MIT
GARNETrRNASequencek‑merDec‑onlyTransformerCLM1819 M89 M tokensGitHubReleasePublic rRNAMIT
IsoCLRpre‑mRNASequenceOne‑hotEnc‑onlyCNNContrast Learning81‑10 M1 MGitHubEnsembl / RefSeq
SpliceBERTpre‑mRNASequenceBaseEnc‑onlyTransformerMLM620 M2 MGitHubZenodoUCSC/GENCODEMIT
Orthruspre‑mRNASequenceBaseEnc‑onlyMambaContrast Learning3‑61‑10 M49 MGitHubHuggingFaceOrtholog setApache‑2.0
LoRNApre‑mRNASequenceBaseDec‑onlyStripedHyenaContrast Learning166.5 M100 MGitHub(announced)SRA (long‑read)MIT
CodonBERTmRNA CDSSequenceCodonEnc‑onlyTransformerMLM / HSP1287 M10 MGitHubHuggingFaceNCBI mRNAApache‑2.0
UTR‑LM5′UTRSequenceBaseEnc‑onlyTransformerMLM / SSP / MFE61 M0.7 MGitHubGitHubPublic 5′UTR setMIT
3UTRBERT3′UTRSequencek‑merEnc‑onlyTransformerMLM1286 M20 kGitHubHuggingFacePublic 3′UTRMIT
G4mermRNASequencek‑merEnc‑onlyTransformerMLM6
HELMmRNASequenceCodonMultiMultiMLM + CLM50 M15.3 M
RiNALMoRNASequenceBaseEnc‑onlyTransformerMLM33135‑650 M36 MGitHub(request)Public ncRNAMIT
UNI‑RNARNASequenceBaseEnc‑onlyTransformerMLM24400 M500 M
ATOM‑1RNASequenceBaseEnc‑DecTransformer
BiRNA‑BERTRNASequenceBase + BPEEnc‑onlyTransformerMLM12117 M36 MGitHubHuggingFacePublic ncRNAMIT
ChaRNABERTRNASequenceGBSTEnc‑onlyTransformerMLM6‑338‑650 M62 M(8 M demo)Public ncRNA
DGRNARNASequenceBaseEnc‑onlyMambaMLM12100 M100 M
LAMARRNASequenceBaseEnc‑onlyTransformerMLM12150 M15 MGitHub(announced)Public ncRNAMIT
OmniGenomeRNASequence, StructureBaseEnc‑onlyTransformerMLM / Seq2Str / Str2Seq16‑3252‑186 M25 MGitHubHuggingFacePublic multi‑omicsApache‑2.0
PlantRNA‑FMRNASequence, StructureBaseEnc‑onlyTransformerMLM / SSP / CLS1235 M25 MHuggingFaceHuggingFacePlant RNA setCC BY‑NC‑4.0
MP‑RNARNASequence, StructureBaseEnc‑onlyTransformerSSP / SNMR / MRLM1252‑186 M25 MGitHub(planned)Public ncRNA mixApache‑2.0

Citations

If you use RNA-FM or any components of this ecosystem in your research, please cite the relevant papers. Below is a collection of key publications (in BibTeX format) covering the foundation model and associated tools:

BibTeX Citations

RNA-FM & RNA Structure Predictions

@article{chen2022interpretable,title={Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions},author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and Shen, Tao and others},journal={arXiv preprint arXiv:2204.00300},year={2022}}@article{shen2024accurate,title={Accurate RNA 3D structure prediction using a language model-based deep learning approach},author={Shen, Tao and Hu, Zhihang and Sun, Siqi and Liu, Di and Wong, Felix and Wang, Jiuming and Chen, Jiayang and Wang, Yixuan and Hong, Liang and Xiao, Jin and others},journal={Nature Methods},pages={1--12},year={2024},publisher={Nature Publishing Group US New York}}@article{chen2020rna,title={RNA secondary structure prediction by learning unrolled algorithms},author={Chen, Xinshi and Li, Yu and Umarov, Ramzan and Gao, Xin and Song, Le},journal={arXiv preprint arXiv:2002.05810},year={2020}}@article{WANG2025102991,title ={Deep learning for RNA structure prediction},author ={Jiuming Wang and Yimin Fan and Liang Hong and Zhihang Hu and Yu Li},journal ={Current Opinion in Structural Biology},year ={2025},doi ={https://doi.org/10.1016/j.sbi.2025.102991},url ={https://www.sciencedirect.com/science/article/pii/S0959440X25000090},}

RNA Design & Inverse Folding

@article{wong2024deep,title={Deep generative design of RNA aptamers using structural predictions},author={Wong, Felix and He, Dongchen and Krishnan, Aarti and Hong, Liang and Wang, Alexander Z and Wang, Jiuming and Hu, Zhihang and Omori, Satotaka and Li, Alicia and Rao, Jiahua and others},journal={Nature Computational Science},pages={1--11},year={2024},publisher={Nature Publishing Group US New York}}@article{huang2024ribodiffusion,title={RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models},author={Huang, Han and Lin, Ziqian and He, Dongchen and Hong, Liang and Li, Yu},journal={Bioinformatics},volume={40},number={Supplement\_1},pages={i347--i356},year={2024},publisher={Oxford University Press}}

RNA-Protein Interaction (RPI)

@article{wei2022protein,title={Protein--RNA interaction prediction with deep learning: structure matters},author={Wei, Junkang and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},journal={Briefings in bioinformatics},volume={23},number={1},pages={bbab540},year={2022},publisher={Oxford University Press}}@article{lam2019deep,title={A deep learning framework to predict binding preference of RNA constituents on protein surface},author={Lam, Jordy Homing and Li, Yu and Zhu, Lizhe and Umarov, Ramzan and Jiang, Hanlun and H{\'e}liou, Am{\'e}lie and Sheong, Fu Kit and Liu, Tianyun and Long, Yongkang and Li, Yunfei and others},journal={Nature communications},volume={10},number={1},pages={4941},year={2019},publisher={Nature Publishing Group UK London}}

Databases & Resources

@article{wei2024pronet,title={ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles},author={Wei, Junkang and Xiao, Jin and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},journal={Database},volume={2024},pages={baae012},year={2024},publisher={Oxford University Press UK}}

Single-Cell RNA Analysis

@article{han2022self,title={Self-supervised contrastive learning for integrative single cell RNA-seq data analysis},author={Han, Wenkai and Cheng, Yuqi and Chen, Jiayang and Zhong, Huawen and Hu, Zhihang and Chen, Siyuan and Zong, Licheng and Hong, Liang and Chan, Ting-Fung and King, Irwin and others},journal={Briefings in Bioinformatics},volume={23},number={5},pages={bbac377},year={2022},publisher={Oxford University Press}}

Drug Discovery

@article{fan2022highly,title={The highly conserved RNA-binding specificity of nucleocapsid protein facilitates the identification of drugs with broad anti-coronavirus activity},author={Fan, Shaorong and Sun, Wenju and Fan, Ligang and Wu, Nan and Sun, Wei and Ma, Haiqian and Chen, Siyuan and Li, Zitong and Li, Yu and Zhang, Jilin and others},journal={Computational and Structural Biotechnology Journal},volume={20},pages={5040--5044},year={2022},publisher={Elsevier}}

License

This source code is licensed under theMIT license found in theLICENSE file in the root directory of this source tree.

Our framework and model training were inspired by:

  • esm (Facebook’s protein language modeling framework)
  • fairseq (PyTorch sequence modeling framework)

We thank the authors of these works for providing excellent foundations for RNA-FM.


Thank you for using RNA-FM!
For issues or questions, open a GitHubIssue or consult thedocumentation. We welcome contributions and collaboration from the community.

About

Nature Methods: RNA foundation model (together with RhoFold)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp