Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

License

NotificationsYou must be signed in to change notification settings

yuzhimanhua/Awesome-Scientific-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 

Repository files navigation

AwesomeStars

PapersLicense: MITPRWelcome

A curated list of pre-trained language models in scientific domains (e.g.,mathematics,physics,chemistry,materials science,biology,medicine,geoscience), covering different model sizes (from100M to100B parameters) and modalities (e.g.,language,graph,vision,table,molecule,protein,genome,climate time series).

The repository is part of our survey paperA Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out toyuzhang@tamu.edu or submit apull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

Contents

General

Language

Language + Graph

  • (SPECTER)SPECTER: Document-level Representation Learning using Citation-informed TransformersACL 2020
    [Paper] [GitHub] [Model (Base)]

  • (OAG-BERT)OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge ServicesKDD 2022
    [Paper] [GitHub]

  • (ASPIRE)Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document SimilarityNAACL 2022
    [Paper] [GitHub] [Model (Base)]

  • (SciNCL)Neighborhood Contrastive Learning for Scientific Document Representations with Citation EmbeddingsEMNLP 2022
    [Paper] [GitHub] [Model (Base)]

  • (SPECTER 2.0)SciRepEval: A Multi-Format Benchmark for Scientific Document RepresentationsEMNLP 2023
    [Paper] [GitHub] [Model (113M)]

  • (SciPatton)Patton: Language Model Pretraining on Text-Rich NetworksACL 2023
    [Paper] [GitHub]

  • (SciMult)Pre-training Multi-task Contrastive Learning Models for Scientific Literature UnderstandingEMNLP 2023 Findings
    [Paper] [GitHub] [Model (138M)]

Mathematics

Language

Language + Vision

  • (Inter-GPS)Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningACL 2021
    [Paper] [GitHub]

  • (Geoformer)UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical ExpressionEMNLP 2022
    [Paper] [GitHub]

  • (SCA-GPS)A Symbolic Character-Aware Model for Solving Geometry ProblemsACM MM 2023
    [Paper] [GitHub]

  • (UniMath-Flan-T5)UniMath: A Foundational and Multimodal Mathematical ReasonerEMNLP 2023
    [Paper] [GitHub]

  • (G-LLaVA)G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language ModelarXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

Other Modalities (Table)

  • (TAPAS)TAPAS: Weakly Supervised Table Parsing via Pre-trainingACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (TaBERT)TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured TablesACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (GraPPa)GraPPa: Grammar-Augmented Pre-training for Table Semantic ParsingICLR 2021
    [Paper] [GitHub] [Model (355M)]

  • (TUTA)TUTA: Tree-Based Transformers for Generally Structured Table Pre-trainingKDD 2021
    [Paper] [GitHub]

  • (RCI)Capturing Row and Column Semantics in Transformer Based Question Answering over TablesNAACL 2021
    [Paper] [GitHub] [Model (12M)]

  • (TABBIE)TABBIE: Pretrained Representations of Tabular DataNAACL 2021
    [Paper] [GitHub]

  • (TAPEX)TAPEX: Table Pre-training via Learning a Neural SQL ExecutorICLR 2022
    [Paper] [GitHub] [Model (140M)] [Model (406M)]

  • (FORTAP)FORTAP: Using Formulas for Numerical-Reasoning-Aware Table PretrainingACL 2022
    [Paper] [GitHub]

  • (OmniTab)OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-Based Question AnsweringNAACL 2022
    [Paper] [GitHub] [Model (406M)]

  • (ReasTAP)ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning ExamplesEMNLP 2022
    [Paper] [GitHub] [Model (406M)]

  • (Table-GPT)Table-GPT: Table-tuned GPT for Diverse Table TasksSIGMOD 2024
    [Paper]

  • (TableLlama)TableLlama: Towards Open Large Generalist Models for TablesNAACL 2024
    [Paper] [GitHub] [Model (7B)]

  • (TableLLM)TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage ScenariosarXiv 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

Physics

Language

  • (astroBERT)Building astroBERT, a Language Model for Astronomy & AstrophysicsarXiv 2021
    [Paper] [Model (Base)]

  • (AstroLLaMA)AstroLLaMA: Towards Specialized Foundation Models in AstronomyAACL 2023 Workshop
    [Paper] [Model (7B)]

  • (AstroLLaMA-Chat)AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse DatasetsResearch Notes of the AAS 2024
    [Paper] [Model (7B)]

  • (PhysBERT)PhysBERT: A Text Embedding Model for Physics Scientific LiteratureAPL Machine Learning 2024
    [Paper] [Model (Base)]

  • (Astro-HEP-BERT)Astro-HEP-BERT: A Bidirectional Language Model for Studying the Meanings of Concepts in Astrophysics and High Energy PhysicsarXiv 2024
    [Paper] [Model (Base)]

Chemistry and Materials Science

Language

  • (ChemBERT)Automated Chemical Reaction Extraction from Scientific LiteratureJournal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatSciBERT)MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extractionnpj Computational Materials 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatBERT)Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials SciencePatterns 2022
    [Paper] [GitHub]

  • (BatteryBERT)BatteryBERT: A Pretrained Language Model for Battery Database EnhancementJournal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MaterialsBERT)A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processingnpj Computational Materials 2023
    [Paper] [Model (Base)]

  • (Recycle-BERT)Recycle-BERT: Extracting Knowledge about Plastic Waste Recycling by Natural Language ProcessingACS Sustainable Chemistry & Engineering 2023
    [Paper] [GitHub]

  • (CatBERTa)Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language ModelsACS Catalysis 2023
    [Paper] [GitHub]

  • (LLM-Prop)LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text DescriptionsarXiv 2023
    [Paper] [GitHub]

  • (ChemDFM)ChemDFM: Dialogue Foundation Model for ChemistryarXiv 2024
    [Paper] [GitHub] [Model (13B)]

  • (CrystalLLM)Fine-Tuned Language Models Generate Stable Inorganic Materials as TextICLR 2024
    [Paper] [GitHub]

  • (ChemLLM)ChemLLM: A Chemical Large Language ModelarXiv 2024
    [Paper] [Model (7B)]

  • (LlaSMol)LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning DatasetCOLM 2024
    [Paper] [GitHub] [Model (6.7B, Galactica)] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]

  • (KALE-LM)KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large ModelarXiv 2024
    [Paper] [Model (8B)]

Language + Graph

  • (Text2Mol)Text2Mol: Cross-Modal Molecule Retrieval with Natural Language QueriesEMNLP 2021
    [Paper] [GitHub]

  • (KV-PLM)A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human ProfessionalsNature Communications 2022
    [Paper] [GitHub] [Model (Base)]

  • (MolT5)Translation between Molecules and Natural LanguageEMNLP 2022
    [Paper] [GitHub] [Model (60M)] [Model (220M)] [Model (770M)]

  • (MoMu)A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural LanguagearXiv 2022
    [Paper] [GitHub]

  • (MoleculeSTM)Multi-modal Molecule Structure-text Model for Text-Based Retrieval and EditingNature Machine Intelligence 2023
    [Paper] [GitHub]

  • (Text+Chem T5)Unifying Molecular and Textual Representations via Multi-task Language ModellingICML 2023
    [Paper] [GitHub] [Model (60M)] [Model (220M)]

  • (GIMLET)GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot LearningNeurIPS 2023
    [Paper] [GitHub] [Model (60M)]

  • (MolFM)MolFM: A Multimodal Molecular Foundation ModelarXiv 2023
    [Paper] [GitHub]

  • (MolCA)MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal AdapterEMNLP 2023
    [Paper] [GitHub]

  • (MolLM)MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular RepresentationsBioinformatics 2024
    [Paper] [GitHub]

  • (InstructMol)InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug DiscoveryCOLING 2025
    [Paper] [GitHub]

  • (3D-MoLM)Towards 3D Molecule-Text Interpretation in Language ModelsICLR 2024
    [Paper] [GitHub]

Language + Vision

  • (GIT-Mol)GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and TextComputers in Biology and Medicine 2024
    [Paper] [GitHub]

Other Modalities (Molecule)

  • (SMILES-BERT)SMILES-BERT: Large Scale Unsupervised Pre-training for Molecular Property PredictionACM BCB 2019
    [Paper] [GitHub]

  • (MAT)Molecule Attention TransformerarXiv 2020
    [Paper] [GitHub]

  • (ChemBERTa)ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property PredictionarXiv 2020
    [Paper] [GitHub] [Model (125M)]

  • (MolBERT)Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary TasksarXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (rxnfp)Mapping the Space of Chemical Reactions using Attention-Based Neural NetworksNature Machine Intelligence 2021
    [Paper] [GitHub] [Model (Base)]

  • (RXNMapper)Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical ReactionsScience Advances 2021
    [Paper] [GitHub]

  • (MoLFormer)Large-Scale Chemical Language Representations Capture Molecular Structure and PropertiesNature Machine Intelligence 2022
    [Paper] [GitHub] [Model (47M)]

  • (Chemformer)Chemformer: A Pre-trained Transformer for Computational ChemistryMachine Learning: Science and Technology 2022
    [Paper] [GitHub] [Model (45M)] [Model (230M)]

  • (R-MAT)Relative Molecule Self-Attention TransformerJournal of Cheminformatics 2024
    [Paper] [GitHub]

  • (MolGPT)MolGPT: Molecular Generation using a Transformer-Decoder ModelJournal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (T5Chem)Unified Deep Learning Model for Multitask Reaction Predictions with ExplanationJournal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (ChemGPT)Neural Scaling of Deep Chemical ModelsNature Machine Intelligence 2023
    [Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)]

  • (Uni-Mol)Uni-Mol: A Universal 3D Molecular Representation Learning FrameworkICLR 2023
    [Paper] [GitHub]

  • (TransPolymer)TransPolymer: A Transformer-Based Language Model for Polymer Property Predictionsnpj Computational Materials 2023
    [Paper] [GitHub]

  • (polyBERT)polyBERT: A Chemical Language Model to Enable Fully Machine-Driven Ultrafast Polymer InformaticsNature Communications 2023
    [Paper] [GitHub] [Model (86M)]

  • (MFBERT)Large-Scale Distributed Training of Transformers for Chemical FingerprintingJournal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (SPMM)Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation ModelNature Communications 2024
    [Paper] [GitHub]

  • (BARTSmiles)BARTSmiles: Generative Masked Language Models for Molecular RepresentationsJournal of Chemical Information and Modeling 2024
    [Paper] [GitHub] [Model (406M)]

  • (MolGen)Domain-Agnostic Molecular Generation with Self-feedbackICLR 2024
    [Paper] [GitHub] [Model (406M, BART)] [Model (7B, LLaMA)]

  • (SELFormer)SELFormer: Molecular Representation Learning via SELFIES Language ModelsMachine Learning: Science and Technology 2023
    [Paper] [GitHub] [Model (58M)] [Model (87M)]

  • (PolyNC)PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer PropertiesChemical Science 2024
    [Paper] [GitHub] [Model (220M)]

Biology and Medicine

Acknowledgment: We referred to Wang et al.'s survey paperPre-trained Language Models in Biomedical Domain: A Systematic Survey and He et al.'s survey paperFoundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions when writing some parts of this section.

Language

Language + Graph

  • (G-BERT)Pre-training of Graph Augmented Transformers for Medication RecommendationIJCAI 2019
    [Paper] [GitHub]

  • (CODER)CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term NormalizationJBI 2022
    [Paper] [GitHub] [Model (Base)]

  • (MoP)Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERTEMNLP 2021
    [Paper] [GitHub]

  • (BioLinkBERT)LinkBERT: Pretraining Language Models with Document LinksACL 2022
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (DRAGON)Deep Bidirectional Language-Knowledge Graph PretrainingNeurIPS 2022
    [Paper] [GitHub] [Model (360M)]

Language + Vision

  • (ConVIRT)Contrastive Learning of Medical Visual Representations from Paired Images and TextMLHC 2022
    [Paper] [GitHub]

  • (MMBERT)MMBERT: Multimodal BERT Pretraining for Improved Medical VQAISBI 2021
    [Paper] [GitHub]

  • (MedViLL)Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-trainingJBHI 2022
    [Paper] [GitHub]

  • (GLoRIA)GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image RecognitionICCV 2021
    [Paper] [GitHub]

  • (LoVT)Joint Learning of Localized Representations from Medical Images and ReportsECCV 2022
    [Paper] [GitHub]

  • (BioViL)Making the Most of Text Semantics to Improve Biomedical Vision-Language ProcessingECCV 2022
    [Paper] [GitHub]

  • (M3AE)Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-trainingMICCAI 2022
    [Paper] [GitHub] [Model]

  • (ARL)Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with KnowledgeACM MM 2022
    [Paper] [GitHub]

  • (CheXzero)Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised LearningNature Biomedical Engineering 2022
    [Paper] [GitHub] [Model]

  • (MGCA)Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation LearningNeurIPS 2022
    [Paper] [GitHub] [Model]

  • (MedCLIP)MedCLIP: Contrastive Learning from Unpaired Medical Images and TextEMNLP 2022
    [Paper] [GitHub]

  • (BioViL-T)Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingCVPR 2023
    [Paper] [GitHub] [Model]

  • (BiomedCLIP)BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text PairsNEJM AI 2024
    [Paper] [Model]

  • (PMC-CLIP)PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical DocumentsMICCAI 2023
    [Paper] [GitHub] [Model]

  • (Xplainer)Xplainer: From X-Ray Observations to Explainable Zero-Shot DiagnosisMICCAI 2023
    [Paper] [GitHub]

  • (RGRG)Interactive and Explainable Region-Guided Radiology Report GenerationCVPR 2023
    [Paper] [GitHub] [Model]

  • (BiomedGPT)A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksNature Medicine 2024
    [Paper] [GitHub] [Model (33M)] [Model (93M)] [Model (182M)]

  • (Med-UniC)Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing BiasNeurIPS 2023
    [Paper] [GitHub]

  • (LLaVA-Med)LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayNeurIPS 2023
    [Paper] [GitHub] [Model (7B)]

  • (MI-Zero)Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology ImagesCVPR 2023
    [Paper] [GitHub] [Model]

  • (XrayGPT)XrayGPT: Chest Radiographs Summarization using Medical Vision-Language ModelsACL 2024 Workshop
    [Paper] [GitHub]

  • (MONET)Transparent Medical Image AI via an Image–Text Foundation Model Grounded in Medical LiteratureNature Medicine 2024
    [Paper] [GitHub]

  • (QuiltNet)Quilt-1M: One Million Image-Text Pairs for HistopathologyNeurIPS 2023
    [Paper] [GitHub] [Model]

  • (MUMC)Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question AnsweringMICCAI 2023
    [Paper] [GitHub]

  • (M-FLAG)M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry OptimizationMICCAI 2023
    [Paper] [GitHub]

  • (PRIOR)PRIOR: Prototype Representation Joint Learning from Medical Images and ReportsICCV 2023
    [Paper] [GitHub]

  • (Med-PaLM M)Towards Generalist Biomedical AINEJM AI 2024
    [Paper] [GitHub]

  • (CITE)Text-Guided Foundation Model Adaptation for Pathological Image ClassificationMICCAI 2023
    [Paper] [GitHub]

  • (Med-Flamingo)Med-Flamingo: A Multimodal Medical Few-shot LearnerML4H 2023
    [Paper] [GitHub]

  • (RadFM)Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical DataarXiv 2023
    [Paper] [GitHub] [Model]

  • (PLIP)A Visual–Language Foundation Model for Pathology Image Analysis using Medical TwitterNature Medicine 2023
    [Paper] [GitHub] [Model]

  • (MaCo)Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive LearningNature Communications 2024
    [Paper] [GitHub]

  • (CXR-CLIP)CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-trainingMICCAI 2023
    [Paper] [GitHub]

  • (Qilin-Med-VL)Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General HealthcarearXiv 2023
    [Paper] [GitHub] [Model]

  • (BioCLIP)BioCLIP: A Vision Foundation Model for the Tree of LifeCVPR 2024
    [Paper] [GitHub] [Model]

  • (M3D)M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language ModelsarXiv 2024
    [Paper] [GitHub] [Model]

  • (Med-Gemini)Capabilities of Gemini Models in MedicinearXiv 2024
    [Paper]

  • (Med-Gemini-2D/3D/Polygenic)Advancing Multimodal Medical Capabilities of GeminiarXiv 2024
    [Paper]

  • (Mammo-CLIP)Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in MammographyMICCAI 2024
    [Paper] [GitHub] [Model]

  • (BiomedParse)A Foundation Model for Joint Segmentation, Detection and Recognition of Biomedical Objects across Nine ModalitiesNature Methods 2025
    [Paper] [GitHub] [Model]

  • (HuatuoGPT-Vision)Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleEMNLP 2024
    [Paper] [GitHub] [Model (7B)] [Model (34B)]

Other Modalities (Protein)

Other Modalities (DNA)

  • (DNABERT)DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in GenomeBioinformatics 2021
    [Paper] [GitHub] [Model (Base)]

  • (GenSLMs)GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary DynamicsThe International Journal of High Performance Computing Applications 2023
    [Paper] [GitHub]

  • (Nucleotide Transformer)Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human GenomicsNature Methods 2024
    [Paper] [GitHub] [Model (50M)] [Model (100M)] [Model (250M)] [Model (500M)]

  • (SpeciesLM)Species-Aware DNA Language Models Capture Regulatory Elements and Their EvolutionGenome Biology 2024
    [Paper] [GitHub] [Model (89M)]

  • (GENA-LM)GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long SequencesNucleic Acids Research 2025
    [Paper] [GitHub] [Model (Base, BERT)] [Model (Large, BERT)] [Model (Base, BigBird)]

  • (DNABERT-2)DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species GenomeICLR 2024
    [Paper] [GitHub] [Model (Base)]

  • (HyenaDNA)HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide ResolutionNeurIPS 2023
    [Paper] [GitHub] [Model (0.4M)] [Model (3.3M)] [Model (6.6M)]

  • (DNAGPT)DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis TasksarXiv 2023
    [Paper] [GitHub] [Model (0.1B)] [Model (3B)]

  • (GPN-MSA)GPN-MSA: An Alignment-Based DNA Language Model for Genome-Wide Variant Effect PredictionNature Biotechnology 2025
    [Paper] [GitHub] [Model (86M)]

  • (ENBED)Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level PrecisionBioinformatics Advances 2024
    [Paper] [GitHub]

  • (LucaOne)LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein LanguagebioRxiv 2024
    [Paper] [GitHub]

  • (AIDO.DNA)Accurate and General DNA Representations Emerge from Genome Foundation Models at ScalebioRxiv 2024
    [Paper] [GitHub] [Model (7B)]

Other Modalities (RNA)

  • (RNABERT)Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation LearningNAR Genomics and Bioinformatics 2022
    [Paper] [GitHub]

  • (RNA-FM)Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function PredictionsarXiv 2022
    [Paper] [GitHub]

  • (SpliceBERT)Self-Supervised Learning on Millions of Primary RNA Sequences from 72 Vertebrates Improves Sequence-Based RNA Splicing PredictionBriefings in Bioinformatics 2024
    [Paper] [GitHub] [Model (19.4M)]

  • (RNA-MSM)Multiple Sequence-Alignment-Based RNA Language Model and its Application to Structural InferenceNucleic Acids Research 2024
    [Paper] [GitHub]

  • (CodonBERT)CodonBERT Large Language Model for mRNA VaccinesGenome Research 2024
    [Paper] [GitHub]

  • (UTR-LM)A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function PredictionsNature Machine Intelligence 2024
    [Paper] [GitHub]

  • (GenerRNA)GenerRNA: A Generative Pre-trained Language Model for de novo RNA DesignPLoS One 2024
    [Paper] [Model (350M)]

  • (RNAErnie)Multi-Purpose RNA Language Modelling with Motif-Aware Pretraining and Type-Guided Fine-TuningNature Machine Intelligence 2024
    [Paper] [GitHub] [Model (105M)]

  • (RNA-TorsionBERT)RNA-TorsionBERT: Leveraging Language Models for RNA 3D Torsion Angles PredictionBioinformatics 2025
    [Paper] [GitHub]

  • (PlantRNA-FM)An Interpretable RNA Foundation Model for Exploring Functional RNA Motifs in PlantsNature Machine Intelligence 2024
    [Paper] [GitHub] [Model (35M)]

  • (AIDO.RNA)A Large-Scale Foundation Model for RNA Function and Structure PredictionbioRxiv 2024
    [Paper] [GitHub] [Model (1.6B)]

Other Modalities (Multiomics)

  • (scBERT)scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq DataNature Machine Intelligence 2022
    [Paper] [GitHub]

  • (scGPT)scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AINature Methods 2024
    [Paper] [GitHub]

  • (scFoundation)Large Scale Foundation Model on Single-cell TranscriptomicsNature Methods 2024
    [Paper] [GitHub] [Model (100M)]

  • (Geneformer)Transfer Learning Enables Predictions in Network BiologyNature 2023
    [Paper] [Model (10M)] [Model (40M)]

  • (CellLM)Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive LearningarXiv 2023
    [Paper] [GitHub]

  • (CellPLM)CellPLM: Pre-training of Cell Language Model Beyond Single CellsICLR 2024
    [Paper] [GitHub] [Model (82M)]

  • (scMulan)scMulan: A Multitask Generative Pre-trained Language Model for Single-Cell AnalysisRECOMB 2024
    [Paper] [GitHub]

Geography, Geology, and Environmental Science

Language

  • (ClimateBERT)ClimateBERT: A Pretrained Language Model for Climate-Related TextarXiv 2021
    [Paper] [GitHub] [Model (82M)]

  • (SpaBERT)SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity RepresentationEMNLP 2022 Findings
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (MGeo)MGeo: Multi-Modal Geographic Pre-training MethodSIGIR 2023
    [Paper] [GitHub]

  • (K2)K2: A Foundation Language Model for Geoscience Knowledge Understanding and UtilizationWSDM 2024
    [Paper] [GitHub] [Model (7B)]

  • (OceanGPT)OceanGPT: A Large Language Model for Ocean Science TasksACL 2024
    [Paper] [GitHub] [Model (7B)]

  • (ClimateBERT-NetZero)ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction TargetsEMNLP 2023
    [Paper] [Model (82M)]

  • (GeoLM)GeoLM: Empowering Language Models for Geospatially Grounded Language UnderstandingEMNLP 2023
    [Paper] [GitHub]

  • (GeoGalactica)GeoGalactica: A Scientific Large Language Model in GeosciencearXiv 2024
    [Paper] [GitHub] [Model (30B)]

  • (UrbanKGent)UrbanKGent: A Unified Large Language Model Agent Framework for Urban Knowledge Graph ConstructionNeurIPS 2024
    [Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (13B, LLaMA-2)] [Model (8B, LLaMA-3)]

  • (UrbanGPT)UrbanGPT: Spatio-Temporal Large Language ModelsKDD 2024
    [Paper] [GitHub] [Model (7B)]

  • (JiuZhou)JiuZhou: Open Foundation Language Models and Effective Pre-training Framework for GeoscienceInternational Journal of Digital Earth 2025
    [Paper] [GitHub] [Model (7B)]

Language + Graph

  • (ERNIE-GeoL)ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu MapsKDD 2022
    [Paper]

  • (PK-Chat)PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue ModelarXiv 2023
    [Paper] [GitHub]

Language + Vision

  • (GeoCLIP)GeoCLIP: CLIP-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localizationNeurIPS 2023
    [Paper] [GitHub]

  • (UrbanCLIP)UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the WebWWW 2024
    [Paper] [GitHub]

Other Modalities (Climate Time Series)

  • (FourCastNet)FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural OperatorsarXiv 2022
    [Paper] [GitHub]

  • (Pangu-Weather)Accurate Medium-Range Global Weather Forecasting with 3D Neural NetworksNature 2023
    [Paper] [GitHub]

  • (GraphCast)Learning Skillful Medium-Range Global Weather ForecastingScience 2023
    [Paper] [GitHub]

  • (ClimaX)ClimaX: A Foundation Model for Weather and ClimateICML 2023
    [Paper] [GitHub]

  • (FengWu)FengWu: Pushing the Skillful Global Medium-Range Weather Forecast beyond 10 Days LeadarXiv 2023
    [Paper] [GitHub]

  • (W-MAE)W-MAE: Pre-trained Weather Model with Masked Autoencoder for Multi-Variable Weather ForecastingarXiv 2023
    [Paper] [GitHub]

  • (FuXi)FuXi: A Cascade Machine Learning Forecasting System for 15-day Global Weather Forecastnpj Climate and Atmospheric Science 2023
    [Paper] [GitHub]

  • (Stormer)Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather ForecastingNeurIPS 2024
    [Paper] [GitHub]

  • (Aurora)A Foundation Model for the Earth SystemarXiv 2024
    [Paper] [GitHub]

  • (Prithvi WxC)Prithvi WxC: Foundation Model for Weather and ClimatearXiv 2024
    [Paper] [GitHub] [Model (2.3B)]

Citation

If you find this repository useful, please cite the following paper:

@inproceedings{zhang2024comprehensive,  title={A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery},  author={Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei},  booktitle={EMNLP'24},  pages={8783--8817},  year={2024}}

About

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp