yuzhimanhua/Awesome-Scientific-Language-ModelsPublic

NotificationsYou must be signed in to change notification settings
Fork30
Star550

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

License

MIT license

550 stars 30 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
LICENSE		LICENSE
README.md		README.md
intro.svg		intro.svg

Repository files navigation

Awesome Scientific Language Models

A curated list of pre-trained language models in scientific domains (e.g.,mathematics,physics,chemistry,materials science,biology,medicine,geoscience), covering different model sizes (from100M to100B parameters) and modalities (e.g.,language,graph,vision,table,molecule,protein,genome,climate time series).

The repository is part of our survey paperA Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out toyuzhang@tamu.edu or submit apull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

General

Language

(SciBERT)SciBERT: A Pretrained Language Model for Scientific TextEMNLP 2019
[Paper] [GitHub] [Model (Base)]
(SciGPT2)Explaining Relationships between Scientific DocumentsACL 2021
[Paper] [GitHub] [Model (117M)]
(CATTS)TLDR: Extreme Summarization of Scientific DocumentsEMNLP 2020 Findings
[Paper] [GitHub] [Model (406M)]
(SciNewsBERT)SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-CheckingCIKM 2021
[Paper] [Model (Base)]
(ScholarBERT)The Diminishing Returns of Masked Language Models to ScienceACL 2023 Findings
[Paper] [Model (Large)] [Model (770M)]
(AcademicRoBERTa)A Japanese Masked Language Model for Academic DomainCOLING 2022 Workshop
[Paper] [GitHub] [Model (125M)]
(Galactica)Galactica: A Large Language Model for SciencearXiv 2022
[Paper] [Model (125M)] [Model (1.3B)] [Model (6.7B)] [Model (30B)] [Model (120B)]
(DARWIN)DARWIN Series: Domain Specific Large Language Models for Natural SciencearXiv 2023
[Paper] [GitHub] [Model (7B)]
(FORGE)FORGE: Pre-training Open Foundation Models for ScienceSC 2023
[Paper] [GitHub] [Model (1.4B, General)] [Model (1.4B, Biology/Medicine)] [Model (1.4B, Chemistry)] [Model (1.4B, Engineering)] [Model (1.4B, Materials Science)] [Model (1.4B, Physics)] [Model (1.4B, Social Science/Art)] [Model (13B, General)] [Model (22B, General)]
(SciGLM)SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language ModelsNeurIPS 2024
[Paper] [GitHub] [Model (6B)]
(INDUS)INDUS: Effective and Efficient Language Models for Scientific ApplicationsEMNLP 2024
[Paper] [Model (38M)] [Model (125M)]
(SciDFM)SciDFM: A Large Language Model with Mixture-of-Experts for SciencearXiv 2024
[Paper] [Model (18.2B)]

Language + Graph

(SPECTER)SPECTER: Document-level Representation Learning using Citation-informed TransformersACL 2020
[Paper] [GitHub] [Model (Base)]
(OAG-BERT)OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge ServicesKDD 2022
[Paper] [GitHub]
(ASPIRE)Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document SimilarityNAACL 2022
[Paper] [GitHub] [Model (Base)]
(SciNCL)Neighborhood Contrastive Learning for Scientific Document Representations with Citation EmbeddingsEMNLP 2022
[Paper] [GitHub] [Model (Base)]
(SPECTER 2.0)SciRepEval: A Multi-Format Benchmark for Scientific Document RepresentationsEMNLP 2023
[Paper] [GitHub] [Model (113M)]
(SciPatton)Patton: Language Model Pretraining on Text-Rich NetworksACL 2023
[Paper] [GitHub]
(SciMult)Pre-training Multi-task Contrastive Learning Models for Scientific Literature UnderstandingEMNLP 2023 Findings
[Paper] [GitHub] [Model (138M)]

Mathematics

Language

(GenBERT)Injecting Numerical Reasoning Skills into Language ModelsACL 2020
[Paper] [GitHub]
(MathBERT)MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics EducationarXiv 2021
[Paper] [GitHub] [Model (Base)]
(MWP-BERT)MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem SolvingNAACL 2022 Findings
[Paper] [GitHub] [Model (Base)]
(BERT-TD)Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word ProblemsACL 2022 Findings
[Paper] [GitHub]
(GSM8K-GPT)Training Verifiers to Solve Math Word ProblemsarXiv 2021
[Paper] [GitHub]
(DeductReasoner)Learning to Reason Deductively: Math Word Problem Solving as Complex Relation ExtractionACL 2022
[Paper] [GitHub] [Model (125M)]
(NaturalProver)NaturalProver: Grounded Mathematical Proof Generation with Language ModelsNeurIPS 2022
[Paper] [GitHub]
(Minerva)Solving Quantitative Reasoning Problems with Language ModelsNeurIPS 2022
[Paper]
(Bhaskara)Lila: A Unified Benchmark for Mathematical ReasoningEMNLP 2022
[Paper] [GitHub] [Model (2.7B)]
(WizardMath)WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-InstructarXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(MAmmoTH)MAmmoTH: Building Math Generalist Models through Hybrid Instruction TuningICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [Model (70B, LLaMA-2)]
(MetaMath)MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [Model (70B, LLaMA-2)]
(ToRA)ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem SolvingICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(MathCoder)MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical ReasoningICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(Llemma)Llemma: An Open Language Model For MathematicsICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (34B)]
(OVM)OVM, Outcome-Supervised Value Models for Planning in Mathematical ReasoningNAACL 2024 Findings
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]
(DeepSeekMath)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsarXiv 2024
[Paper] [GitHub] [Model (7B)]
(InternLM-Math)InternLM-Math: Open Math Large Language Models Toward Verifiable ReasoningarXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (20B)]
(OpenMath)OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning DatasetNeurIPS 2024
[Paper] [Model (7B, Mistral)] [Model (70B, LLaMA-2)]
(Rho-Math)Rho-1: Not All Tokens Are What You NeedNeurIPS 2024
[Paper] [GitHub] [Model (1B)] [Model (7B)]
(MAmmoTH2)MAmmoTH2: Scaling Instructions from the WebNeurIPS 2024
[Paper] [GitHub] [Model (7B, Mistral)] [Model (8B, LLaMA-3)] [Model (8x7B, Mixtral)]
(TheoremLlama)TheoremLlama: Transforming General-Purpose LLMs into Lean4 ExpertsEMNLP 2024
[Paper] [GitHub] [Model (8B)]

Language + Vision

(Inter-GPS)Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningACL 2021
[Paper] [GitHub]
(Geoformer)UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical ExpressionEMNLP 2022
[Paper] [GitHub]
(SCA-GPS)A Symbolic Character-Aware Model for Solving Geometry ProblemsACM MM 2023
[Paper] [GitHub]
(UniMath-Flan-T5)UniMath: A Foundational and Multimodal Mathematical ReasonerEMNLP 2023
[Paper] [GitHub]
(G-LLaVA)G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language ModelarXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)]

Other Modalities (Table)

(TAPAS)TAPAS: Weakly Supervised Table Parsing via Pre-trainingACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(TaBERT)TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured TablesACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(GraPPa)GraPPa: Grammar-Augmented Pre-training for Table Semantic ParsingICLR 2021
[Paper] [GitHub] [Model (355M)]
(TUTA)TUTA: Tree-Based Transformers for Generally Structured Table Pre-trainingKDD 2021
[Paper] [GitHub]
(RCI)Capturing Row and Column Semantics in Transformer Based Question Answering over TablesNAACL 2021
[Paper] [GitHub] [Model (12M)]
(TABBIE)TABBIE: Pretrained Representations of Tabular DataNAACL 2021
[Paper] [GitHub]
(TAPEX)TAPEX: Table Pre-training via Learning a Neural SQL ExecutorICLR 2022
[Paper] [GitHub] [Model (140M)] [Model (406M)]
(FORTAP)FORTAP: Using Formulas for Numerical-Reasoning-Aware Table PretrainingACL 2022
[Paper] [GitHub]
(OmniTab)OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-Based Question AnsweringNAACL 2022
[Paper] [GitHub] [Model (406M)]
(ReasTAP)ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning ExamplesEMNLP 2022
[Paper] [GitHub] [Model (406M)]
(Table-GPT)Table-GPT: Table-tuned GPT for Diverse Table TasksSIGMOD 2024
[Paper]
(TableLlama)TableLlama: Towards Open Large Generalist Models for TablesNAACL 2024
[Paper] [GitHub] [Model (7B)]
(TableLLM)TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage ScenariosarXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]

Physics

Language

(astroBERT)Building astroBERT, a Language Model for Astronomy & AstrophysicsarXiv 2021
[Paper] [Model (Base)]
(AstroLLaMA)AstroLLaMA: Towards Specialized Foundation Models in AstronomyAACL 2023 Workshop
[Paper] [Model (7B)]
(AstroLLaMA-Chat)AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse DatasetsResearch Notes of the AAS 2024
[Paper] [Model (7B)]
(PhysBERT)PhysBERT: A Text Embedding Model for Physics Scientific LiteratureAPL Machine Learning 2024
[Paper] [Model (Base)]
(Astro-HEP-BERT)Astro-HEP-BERT: A Bidirectional Language Model for Studying the Meanings of Concepts in Astrophysics and High Energy PhysicsarXiv 2024
[Paper] [Model (Base)]

Chemistry and Materials Science

Language

(ChemBERT)Automated Chemical Reaction Extraction from Scientific LiteratureJournal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)]
(MatSciBERT)MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extractionnpj Computational Materials 2022
[Paper] [GitHub] [Model (Base)]
(MatBERT)Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials SciencePatterns 2022
[Paper] [GitHub]
(BatteryBERT)BatteryBERT: A Pretrained Language Model for Battery Database EnhancementJournal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)]
(MaterialsBERT)A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processingnpj Computational Materials 2023
[Paper] [Model (Base)]
(Recycle-BERT)Recycle-BERT: Extracting Knowledge about Plastic Waste Recycling by Natural Language ProcessingACS Sustainable Chemistry & Engineering 2023
[Paper] [GitHub]
(CatBERTa)Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language ModelsACS Catalysis 2023
[Paper] [GitHub]
(LLM-Prop)LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text DescriptionsarXiv 2023
[Paper] [GitHub]
(ChemDFM)ChemDFM: Dialogue Foundation Model for ChemistryarXiv 2024
[Paper] [GitHub] [Model (13B)]
(CrystalLLM)Fine-Tuned Language Models Generate Stable Inorganic Materials as TextICLR 2024
[Paper] [GitHub]
(ChemLLM)ChemLLM: A Chemical Large Language ModelarXiv 2024
[Paper] [Model (7B)]
(LlaSMol)LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning DatasetCOLM 2024
[Paper] [GitHub] [Model (6.7B, Galactica)] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]
(KALE-LM)KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large ModelarXiv 2024
[Paper] [Model (8B)]

Language + Graph

(Text2Mol)Text2Mol: Cross-Modal Molecule Retrieval with Natural Language QueriesEMNLP 2021
[Paper] [GitHub]
(KV-PLM)A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human ProfessionalsNature Communications 2022
[Paper] [GitHub] [Model (Base)]
(MolT5)Translation between Molecules and Natural LanguageEMNLP 2022
[Paper] [GitHub] [Model (60M)] [Model (220M)] [Model (770M)]
(MoMu)A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural LanguagearXiv 2022
[Paper] [GitHub]
(MoleculeSTM)Multi-modal Molecule Structure-text Model for Text-Based Retrieval and EditingNature Machine Intelligence 2023
[Paper] [GitHub]
(Text+Chem T5)Unifying Molecular and Textual Representations via Multi-task Language ModellingICML 2023
[Paper] [GitHub] [Model (60M)] [Model (220M)]
(GIMLET)GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot LearningNeurIPS 2023
[Paper] [GitHub] [Model (60M)]
(MolFM)MolFM: A Multimodal Molecular Foundation ModelarXiv 2023
[Paper] [GitHub]
(MolCA)MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal AdapterEMNLP 2023
[Paper] [GitHub]
(MolLM)MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular RepresentationsBioinformatics 2024
[Paper] [GitHub]
(InstructMol)InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug DiscoveryCOLING 2025
[Paper] [GitHub]
(3D-MoLM)Towards 3D Molecule-Text Interpretation in Language ModelsICLR 2024
[Paper] [GitHub]

Language + Vision

(GIT-Mol)GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and TextComputers in Biology and Medicine 2024
[Paper] [GitHub]

Other Modalities (Molecule)

(SMILES-BERT)SMILES-BERT: Large Scale Unsupervised Pre-training for Molecular Property PredictionACM BCB 2019
[Paper] [GitHub]
(MAT)Molecule Attention TransformerarXiv 2020
[Paper] [GitHub]
(ChemBERTa)ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property PredictionarXiv 2020
[Paper] [GitHub] [Model (125M)]
(MolBERT)Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary TasksarXiv 2020
[Paper] [GitHub] [Model (Base)]
(rxnfp)Mapping the Space of Chemical Reactions using Attention-Based Neural NetworksNature Machine Intelligence 2021
[Paper] [GitHub] [Model (Base)]
(RXNMapper)Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical ReactionsScience Advances 2021
[Paper] [GitHub]
(MoLFormer)Large-Scale Chemical Language Representations Capture Molecular Structure and PropertiesNature Machine Intelligence 2022
[Paper] [GitHub] [Model (47M)]
(Chemformer)Chemformer: A Pre-trained Transformer for Computational ChemistryMachine Learning: Science and Technology 2022
[Paper] [GitHub] [Model (45M)] [Model (230M)]
(R-MAT)Relative Molecule Self-Attention TransformerJournal of Cheminformatics 2024
[Paper] [GitHub]
(MolGPT)MolGPT: Molecular Generation using a Transformer-Decoder ModelJournal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(T5Chem)Unified Deep Learning Model for Multitask Reaction Predictions with ExplanationJournal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(ChemGPT)Neural Scaling of Deep Chemical ModelsNature Machine Intelligence 2023
[Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)]
(Uni-Mol)Uni-Mol: A Universal 3D Molecular Representation Learning FrameworkICLR 2023
[Paper] [GitHub]
(TransPolymer)TransPolymer: A Transformer-Based Language Model for Polymer Property Predictionsnpj Computational Materials 2023
[Paper] [GitHub]
(polyBERT)polyBERT: A Chemical Language Model to Enable Fully Machine-Driven Ultrafast Polymer InformaticsNature Communications 2023
[Paper] [GitHub] [Model (86M)]
(MFBERT)Large-Scale Distributed Training of Transformers for Chemical FingerprintingJournal of Chemical Information and Modeling 2022
[Paper] [GitHub]
(SPMM)Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation ModelNature Communications 2024
[Paper] [GitHub]
(BARTSmiles)BARTSmiles: Generative Masked Language Models for Molecular RepresentationsJournal of Chemical Information and Modeling 2024
[Paper] [GitHub] [Model (406M)]
(MolGen)Domain-Agnostic Molecular Generation with Self-feedbackICLR 2024
[Paper] [GitHub] [Model (406M, BART)] [Model (7B, LLaMA)]
(SELFormer)SELFormer: Molecular Representation Learning via SELFIES Language ModelsMachine Learning: Science and Technology 2023
[Paper] [GitHub] [Model (58M)] [Model (87M)]
(PolyNC)PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer PropertiesChemical Science 2024
[Paper] [GitHub] [Model (220M)]

Biology and Medicine

Acknowledgment: We referred to Wang et al.'s survey paperPre-trained Language Models in Biomedical Domain: A Systematic Survey and He et al.'s survey paperFoundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions when writing some parts of this section.

Language

(BioBERT)BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text MiningBioinformatics 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(BioELMo)Probing Biomedical Embeddings from Language ModelsNAACL 2019 Workshop
[Paper] [GitHub] [Model (93M)]
(ClinicalBERT, Alsentzer et al.)Publicly Available Clinical BERT EmbeddingsNAACL 2019 Workshop
[Paper] [GitHub] [Model (Base)]
(ClinicalBERT, Huang et al.)ClinicalBERT: Modeling Clinical Notes and Predicting Hospital ReadmissionarXiv 2019
[Paper] [GitHub] [Model (Base)]
(BlueBERT, f.k.a. NCBI-BERT)Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking DatasetsACL 2019 Workshop
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(BEHRT)BEHRT: Transformer for Electronic Health RecordsScientific Reports 2020
[Paper] [GitHub]
(EhrBERT)Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical StudyJMIR Medical Informatics 2019
[Paper] [GitHub]
(Clinical XLNet)Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical VentilationEMNLP 2020 Workshop
[Paper] [GitHub]
(ouBioBERT)Pre-training Technique to Localize Medical BERT and Enhance Biomedical BERTarXiv 2020
[Paper] [GitHub] [Model (Base)]
(COVID-Twitter-BERT)COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on TwitterFrontiers in Artificial Intelligence 2023
[Paper] [GitHub] [Model (Large)]
(Med-BERT)Med-BERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Records for Disease Predictionnpj Digital Medicine 2021
[Paper] [GitHub]
(Bio-ELECTRA)On the Effectiveness of Small, Discriminatively Pre-trained Language Representation Models for Biomedical Text MiningEMNLP 2020 Workshop
[Paper] [GitHub] [Model (Base)]
(BiomedBERT, f.k.a. PubMedBERT)Domain-Specific Language Model Pretraining for Biomedical Natural Language ProcessingACM Transactions on Computing for Healthcare 2021
[Paper] [Model (Base)] [Model (Large)]
(MCBERT)Conceptualized Representation Learning for Chinese Biomedical Text MiningarXiv 2020
[Paper] [GitHub] [Model (Base)]
(BRLTM)Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict DepressionJBHI 2021
[Paper] [GitHub]
(BioRedditBERT)COMETA: A Corpus for Medical Entity Linking in the Social MediaEMNLP 2020
[Paper] [GitHub] [Model (Base)]
(BioMegatron)BioMegatron: Larger Biomedical Domain Language ModelEMNLP 2020
[Paper] [GitHub] [Model (345M)]
(SapBERT)Self-Alignment Pretraining for Biomedical Entity RepresentationsNAACL 2021
[Paper] [GitHub] [Model (Base)]
(ClinicalTransformer)Clinical Concept Extraction using TransformersJAMIA 2020
[Paper] [GitHub] [Model (Base, BERT)] [Model (125M, RoBERTa)] [Model (12M, ALBERT)] [Model (Base, ELECTRA)] [Model (Base, XLNet)] [Model (149M, Longformer)] [Model (86M, DeBERTa)]
(BioRoBERTa)Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-ArtEMNLP 2020 Workshop
[Paper] [GitHub] [Model (125M)] [Model (355M)]
(RAD-BERT)Highly Accurate Classification of Chest Radiographic Reports using a Deep Learning Natural Language Model Pre-trained on 3.8 Million Text ReportsBioinformatics 2020
[Paper] [GitHub]
(BioMedBERT)BioMedBERT: A Pre-trained Biomedical Language Model for QA and IRCOLING 2020
[Paper] [GitHub]
(LBERT)LBERT: Lexically Aware Transformer-Based Bidirectional Encoder Representation Model for Learning Universal Bio-Entity RelationsBioinformatics 2021
[Paper] [GitHub]
(ELECTRAMed)ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLParXiv 2021
[Paper] [GitHub] [Model (Base)]
(KeBioLM)Improving Biomedical Pretrained Language Models with KnowledgeNAACL 2021 Workshop
[Paper] [GitHub]
(SciFive)SciFive: A Text-to-Text Transformer Model for Biomedical LiteraturearXiv 2021
[Paper] [GitHub] [Model (220M)] [Model (770M)]
(BioALBERT)Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTBMC Bioinformatics 2022
[Paper] [GitHub] [Model (12M)] [Model (18M)]
(Clinical-Longformer)Clinical-Longformer and Clinical-BigBird: Transformers for Long Clinical SequencesarXiv 2022
[Paper] [GitHub] [Model (149M, Longformer)] [Model (Base, BigBird)]
(BioBART)BioBART: Pretraining and Evaluation of A Biomedical Generative Language ModelACL 2022 Workshop
[Paper] [GitHub] [Model (140M)] [Model (406M)]
(BioGPT)BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and MiningBriefings in Bioinformatics 2022
[Paper] [GitHub] [Model (355M)] [Model (1.5B)]
(Med-PaLM)Large Language Models Encode Clinical KnowledgeNature 2023
[Paper]
(GatorTron)A Large Language Model for Electronic Health Recordsnpj Digital Medicine 2022
[Paper] [GitHub] [Model (345M)] [Model (3.9B)] [Model (8.9B)]
(ChatDoctor)ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) using Medical Domain KnowledgeCureus 2023
[Paper] [GitHub]
(DoctorGLM)DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean TaskarXiv 2023
[Paper] [GitHub]
(BenTsao, f.k.a. HuaTuo)HuaTuo: Tuning LLaMA Model with Chinese Medical KnowledgearXiv 2023
[Paper] [GitHub]
(MedAlpaca)MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training DataarXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(PMC-LLaMA)PMC-LLaMA: Towards Building Open-source Language Models for MedicineJAMIA 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(Med-PaLM 2)Toward Expert-Level Medical Question Answering with Large Language ModelsNature Medicine 2025
[Paper]
(HuatuoGPT)HuatuoGPT, towards Taming Language Model to Be a DoctorEMNLP 2023 Findings
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(MedCPT)MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information RetrievalBioinformatics 2023
[Paper] [GitHub] [Model (Base)]
(Zhongjing)Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn DialogueAAAI 2024
[Paper] [GitHub] [Model (13B)]
(DISC-MedLLM)DISC-MedLLM: Bridging General Large Language Models and Real-World Medical ConsultationarXiv 2023
[Paper] [GitHub] [Model (13B)]
(DRG-LLaMA)DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-related Group for Hospitalized Patientsnpj Digital Medicine 2024
[Paper] [GitHub]
(Qilin-Med)Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language ModelarXiv 2023
[Paper] [GitHub]
(AlpaCare)AlpaCare: Instruction-tuned Large Language Models for Medical ApplicationarXiv 2023
[Paper] [GitHub] [Model (7B, LLaMA)] [Model (7B, LLaMA-2)] [Model (13B, LLaMA)] [Model (13B, LLaMA-2)]
(BianQue)BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPTarXiv 2023
[Paper] [GitHub] [Model (6B)]
(HuatuoGPT-II)HuatuoGPT-II, One-stage Training for Medical Adaption of LLMsCOLM 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (34B)]
(Taiyi)Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical TasksJAMIA 2024
[Paper] [GitHub] [Model (7B)]
(MEDITRON)MEDITRON-70B: Scaling Medical Pretraining for Large Language ModelsarXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (70B)]
(PLLaMa)PLLaMa: An Open-source Large Language Model for Plant SciencearXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]
(BioMistral)BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical DomainsACL 2024 Findings
[Paper] [Model (7B)]
(Me-LLaMA)Me-LLaMA: Foundation Large Language Models for Medical ApplicationsarXiv 2024
[Paper] [GitHub]
(BiMediX)BiMediX: Bilingual Medical Mixture of Experts LLMEMNLP 2024 Findings
[Paper] [GitHub] [Model (8x7B)]
(MMedLM)Towards Building Multilingual Language Model for MedicineNature Communications 2024
[Paper] [GitHub] [Model (7B, InternLM)] [Model (1.8B, InternLM2)] [Model (7B, InternLM2)] [Model (8B, LLaMA-3)]
(BioMedLM, f.k.a. PubMedGPT)BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical TextarXiv 2024
[Paper] [GitHub] [Model (2.7B)]
(Hippocrates)Hippocrates: An Open-Source Framework for Advancing Large Language Models in HealthcarearXiv 2024
[Paper] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]
(BMRetriever)BMRetriever: Tuning Large Language Models as Better Biomedical Text RetrieversEMNLP 2024
[Paper] [GitHub] [Model (410M, Pythia)] [Model (1B, Pythia)] [Model (2B, Gemma)] [Model (7B, Mistral)]
(UltraMedical)UltraMedical: Building Specialized Generalists in BiomedicineNeurIPS 2024
[Paper] [GitHub] [Model (8B, LLaMA-3)] [Model (70B, LLaMA-3)] [Model (8B, LLaMA-3.1)]
(Panacea)Panacea: A Foundation Model for Clinical Trial Search, Summarization, Design, and RecruitmentarXiv 2024
[Paper] [GitHub] [Model (7B)]
(HuatuoGPT-o1)HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMsarXiv 2024
[Paper] [GitHub] [Model (8B, LLaMA-3.1)] [Model (70B, LLaMA-3.1)] [Model (7B, Qwen2.5)] [Model (72B, Qwen2.5)]

Language + Graph

(G-BERT)Pre-training of Graph Augmented Transformers for Medication RecommendationIJCAI 2019
[Paper] [GitHub]
(CODER)CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term NormalizationJBI 2022
[Paper] [GitHub] [Model (Base)]
(MoP)Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERTEMNLP 2021
[Paper] [GitHub]
(BioLinkBERT)LinkBERT: Pretraining Language Models with Document LinksACL 2022
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(DRAGON)Deep Bidirectional Language-Knowledge Graph PretrainingNeurIPS 2022
[Paper] [GitHub] [Model (360M)]

Language + Vision

(ConVIRT)Contrastive Learning of Medical Visual Representations from Paired Images and TextMLHC 2022
[Paper] [GitHub]
(MMBERT)MMBERT: Multimodal BERT Pretraining for Improved Medical VQAISBI 2021
[Paper] [GitHub]
(MedViLL)Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-trainingJBHI 2022
[Paper] [GitHub]
(GLoRIA)GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image RecognitionICCV 2021
[Paper] [GitHub]
(LoVT)Joint Learning of Localized Representations from Medical Images and ReportsECCV 2022
[Paper] [GitHub]
(BioViL)Making the Most of Text Semantics to Improve Biomedical Vision-Language ProcessingECCV 2022
[Paper] [GitHub]
(M3AE)Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-trainingMICCAI 2022
[Paper] [GitHub] [Model]
(ARL)Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with KnowledgeACM MM 2022
[Paper] [GitHub]
(CheXzero)Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised LearningNature Biomedical Engineering 2022
[Paper] [GitHub] [Model]
(MGCA)Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation LearningNeurIPS 2022
[Paper] [GitHub] [Model]
(MedCLIP)MedCLIP: Contrastive Learning from Unpaired Medical Images and TextEMNLP 2022
[Paper] [GitHub]
(BioViL-T)Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingCVPR 2023
[Paper] [GitHub] [Model]
(BiomedCLIP)BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text PairsNEJM AI 2024
[Paper] [Model]
(PMC-CLIP)PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical DocumentsMICCAI 2023
[Paper] [GitHub] [Model]
(Xplainer)Xplainer: From X-Ray Observations to Explainable Zero-Shot DiagnosisMICCAI 2023
[Paper] [GitHub]
(RGRG)Interactive and Explainable Region-Guided Radiology Report GenerationCVPR 2023
[Paper] [GitHub] [Model]
(BiomedGPT)A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksNature Medicine 2024
[Paper] [GitHub] [Model (33M)] [Model (93M)] [Model (182M)]
(Med-UniC)Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing BiasNeurIPS 2023
[Paper] [GitHub]
(LLaVA-Med)LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayNeurIPS 2023
[Paper] [GitHub] [Model (7B)]
(MI-Zero)Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology ImagesCVPR 2023
[Paper] [GitHub] [Model]
(XrayGPT)XrayGPT: Chest Radiographs Summarization using Medical Vision-Language ModelsACL 2024 Workshop
[Paper] [GitHub]
(MONET)Transparent Medical Image AI via an Image–Text Foundation Model Grounded in Medical LiteratureNature Medicine 2024
[Paper] [GitHub]
(QuiltNet)Quilt-1M: One Million Image-Text Pairs for HistopathologyNeurIPS 2023
[Paper] [GitHub] [Model]
(MUMC)Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question AnsweringMICCAI 2023
[Paper] [GitHub]
(M-FLAG)M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry OptimizationMICCAI 2023
[Paper] [GitHub]
(PRIOR)PRIOR: Prototype Representation Joint Learning from Medical Images and ReportsICCV 2023
[Paper] [GitHub]
(Med-PaLM M)Towards Generalist Biomedical AINEJM AI 2024
[Paper] [GitHub]
(CITE)Text-Guided Foundation Model Adaptation for Pathological Image ClassificationMICCAI 2023
[Paper] [GitHub]
(Med-Flamingo)Med-Flamingo: A Multimodal Medical Few-shot LearnerML4H 2023
[Paper] [GitHub]
(RadFM)Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical DataarXiv 2023
[Paper] [GitHub] [Model]
(PLIP)A Visual–Language Foundation Model for Pathology Image Analysis using Medical TwitterNature Medicine 2023
[Paper] [GitHub] [Model]
(MaCo)Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive LearningNature Communications 2024
[Paper] [GitHub]
(CXR-CLIP)CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-trainingMICCAI 2023
[Paper] [GitHub]
(Qilin-Med-VL)Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General HealthcarearXiv 2023
[Paper] [GitHub] [Model]
(BioCLIP)BioCLIP: A Vision Foundation Model for the Tree of LifeCVPR 2024
[Paper] [GitHub] [Model]
(M3D)M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language ModelsarXiv 2024
[Paper] [GitHub] [Model]
(Med-Gemini)Capabilities of Gemini Models in MedicinearXiv 2024
[Paper]
(Med-Gemini-2D/3D/Polygenic)Advancing Multimodal Medical Capabilities of GeminiarXiv 2024
[Paper]
(Mammo-CLIP)Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in MammographyMICCAI 2024
[Paper] [GitHub] [Model]
(BiomedParse)A Foundation Model for Joint Segmentation, Detection and Recognition of Biomedical Objects across Nine ModalitiesNature Methods 2025
[Paper] [GitHub] [Model]
(HuatuoGPT-Vision)Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleEMNLP 2024
[Paper] [GitHub] [Model (7B)] [Model (34B)]

Other Modalities (Protein)

(ProtTrans)ProtTrans: Toward Understanding the Language of Life Through Self-Supervised LearningTPAMI 2021
[Paper] [GitHub] [Model (420M, BERT)] [Model (224M, ALBERT)] [Model (409M, XLNet)] [Model (420M, ELECTRA)] [Model (3B, T5)] [Model (11B, T5)]
(ESM-1b)Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein SequencesPNAS 2021
[Paper] [GitHub] [Model (650M)]
(MSA Transformer)MSA TransformerICML 2021
[Paper] [GitHub]
(ESM-1v)Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein FunctionNeurIPS 2021
[Paper] [GitHub] [Model (650M)]
(AminoBERT)Single-Sequence Protein Structure Prediction using a Language Model and Deep LearningNature Biotechnology 2022
[Paper] [GitHub]
(ProteinBERT)ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and FunctionBioinformatics 2022
[Paper] [GitHub] [Model (16M)]
(ProtGPT2)ProtGPT2 is a Deep Unsupervised Language Model for Protein DesignNature Communications 2022
[Paper] [Model (738M)]
(ESM-IF1)Learning Inverse Folding from Millions of Predicted StructuresICML 2022
[Paper] [GitHub] [Model (142M)]
(ProGen)Large Language Models Generate Functional Protein Sequences across Diverse FamiliesNature Biotechnology 2023
[Paper] [GitHub] [Model (1.6B)]
(ProGen2)ProGen2: Exploring the Boundaries of Protein Language ModelsCell Systems 2023
[Paper] [GitHub] [Model (151M)] [Model (764M)] [Model (2.7B)] [Model (6.4B)]
(ESM-2)Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language ModelScience 2023
[Paper] [GitHub] [Model (8M)] [Model (35M)] [Model (150M)] [Model (650M)] [Model (3B)] [Model (15B)]
(Ankh)Ankh: Optimized Protein Language Model Unlocks General-Purpose ModellingarXiv 2023
[Paper] [GitHub] [Model (450M)] [Model (1.1B)]
(ProtST)ProtST: Multi-Modality Learning of Protein Sequences and Biomedical TextsICML 2023
[Paper] [GitHub]
(LM-Design)Structure-informed Language Models Are Protein DesignersICML 2023
[Paper] [GitHub] [Model (659M)]
(ProteinDT)A Text-Guided Protein Design FrameworkarXiv 2023
[Paper] [GitHub]
(gLM)Genomic Language Model Predicts Protein Co-Regulation and FunctionNature Communications 2024
[Paper] [GitHub] [Model (1B)]
(Prot2Text)Prot2Text: Multimodal Protein's Function Generation with GNNs and TransformersAAAI 2024
[Paper] [GitHub] [Model (256M)] [Model (283M)] [Model (398M)] [Model (898M)]
(BioMedGPT)BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicinearXiv 2023
[Paper] [GitHub] [Model (10B)]
(SaProt)SaProt: Protein Language Modeling with Structure-Aware VocabularyICLR 2024
[Paper] [GitHub] [Model (35M)] [Model (650M)]
(BioT5)BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language AssociationsEMNLP 2023
[Paper] [GitHub] [Model (220M)]
(xTrimoPGLM)xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of ProteinarXiv 2024
[Paper] [GitHub] [Model (1B)] [Model (3B)] [Model (10B)] [Model (100B)]
(ProLLaMA)ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language ProcessingarXiv 2024
[Paper] [GitHub] [Model (7B)]
(ProteinCLIP)ProteinCLIP: Enhancing Protein Language Models with Natural LanguagebioRxiv 2024
[Paper] [GitHub]
(ESM-3)Simulating 500 Million Years of Evolution with a Language ModelScience 2025
[Paper] [GitHub] [Model (98B)]

Other Modalities (DNA)

(DNABERT)DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in GenomeBioinformatics 2021
[Paper] [GitHub] [Model (Base)]
(GenSLMs)GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary DynamicsThe International Journal of High Performance Computing Applications 2023
[Paper] [GitHub]
(Nucleotide Transformer)Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human GenomicsNature Methods 2024
[Paper] [GitHub] [Model (50M)] [Model (100M)] [Model (250M)] [Model (500M)]
(SpeciesLM)Species-Aware DNA Language Models Capture Regulatory Elements and Their EvolutionGenome Biology 2024
[Paper] [GitHub] [Model (89M)]
(GENA-LM)GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long SequencesNucleic Acids Research 2025
[Paper] [GitHub] [Model (Base, BERT)] [Model (Large, BERT)] [Model (Base, BigBird)]
(DNABERT-2)DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species GenomeICLR 2024
[Paper] [GitHub] [Model (Base)]
(HyenaDNA)HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide ResolutionNeurIPS 2023
[Paper] [GitHub] [Model (0.4M)] [Model (3.3M)] [Model (6.6M)]
(DNAGPT)DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis TasksarXiv 2023
[Paper] [GitHub] [Model (0.1B)] [Model (3B)]
(GPN-MSA)GPN-MSA: An Alignment-Based DNA Language Model for Genome-Wide Variant Effect PredictionNature Biotechnology 2025
[Paper] [GitHub] [Model (86M)]
(ENBED)Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level PrecisionBioinformatics Advances 2024
[Paper] [GitHub]
(LucaOne)LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein LanguagebioRxiv 2024
[Paper] [GitHub]
(AIDO.DNA)Accurate and General DNA Representations Emerge from Genome Foundation Models at ScalebioRxiv 2024
[Paper] [GitHub] [Model (7B)]

Other Modalities (RNA)

(RNABERT)Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation LearningNAR Genomics and Bioinformatics 2022
[Paper] [GitHub]
(RNA-FM)Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function PredictionsarXiv 2022
[Paper] [GitHub]
(SpliceBERT)Self-Supervised Learning on Millions of Primary RNA Sequences from 72 Vertebrates Improves Sequence-Based RNA Splicing PredictionBriefings in Bioinformatics 2024
[Paper] [GitHub] [Model (19.4M)]
(RNA-MSM)Multiple Sequence-Alignment-Based RNA Language Model and its Application to Structural InferenceNucleic Acids Research 2024
[Paper] [GitHub]
(CodonBERT)CodonBERT Large Language Model for mRNA VaccinesGenome Research 2024
[Paper] [GitHub]
(UTR-LM)A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function PredictionsNature Machine Intelligence 2024
[Paper] [GitHub]
(GenerRNA)GenerRNA: A Generative Pre-trained Language Model for de novo RNA DesignPLoS One 2024
[Paper] [Model (350M)]
(RNAErnie)Multi-Purpose RNA Language Modelling with Motif-Aware Pretraining and Type-Guided Fine-TuningNature Machine Intelligence 2024
[Paper] [GitHub] [Model (105M)]
(RNA-TorsionBERT)RNA-TorsionBERT: Leveraging Language Models for RNA 3D Torsion Angles PredictionBioinformatics 2025
[Paper] [GitHub]
(PlantRNA-FM)An Interpretable RNA Foundation Model for Exploring Functional RNA Motifs in PlantsNature Machine Intelligence 2024
[Paper] [GitHub] [Model (35M)]
(AIDO.RNA)A Large-Scale Foundation Model for RNA Function and Structure PredictionbioRxiv 2024
[Paper] [GitHub] [Model (1.6B)]

Other Modalities (Multiomics)

(scBERT)scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq DataNature Machine Intelligence 2022
[Paper] [GitHub]
(scGPT)scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AINature Methods 2024
[Paper] [GitHub]
(scFoundation)Large Scale Foundation Model on Single-cell TranscriptomicsNature Methods 2024
[Paper] [GitHub] [Model (100M)]
(Geneformer)Transfer Learning Enables Predictions in Network BiologyNature 2023
[Paper] [Model (10M)] [Model (40M)]
(CellLM)Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive LearningarXiv 2023
[Paper] [GitHub]
(CellPLM)CellPLM: Pre-training of Cell Language Model Beyond Single CellsICLR 2024
[Paper] [GitHub] [Model (82M)]
(scMulan)scMulan: A Multitask Generative Pre-trained Language Model for Single-Cell AnalysisRECOMB 2024
[Paper] [GitHub]

Geography, Geology, and Environmental Science

Language

(ClimateBERT)ClimateBERT: A Pretrained Language Model for Climate-Related TextarXiv 2021
[Paper] [GitHub] [Model (82M)]
(SpaBERT)SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity RepresentationEMNLP 2022 Findings
[Paper] [GitHub] [Model (Base)] [Model (Large)]
(MGeo)MGeo: Multi-Modal Geographic Pre-training MethodSIGIR 2023
[Paper] [GitHub]
(K2)K2: A Foundation Language Model for Geoscience Knowledge Understanding and UtilizationWSDM 2024
[Paper] [GitHub] [Model (7B)]
(OceanGPT)OceanGPT: A Large Language Model for Ocean Science TasksACL 2024
[Paper] [GitHub] [Model (7B)]
(ClimateBERT-NetZero)ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction TargetsEMNLP 2023
[Paper] [Model (82M)]
(GeoLM)GeoLM: Empowering Language Models for Geospatially Grounded Language UnderstandingEMNLP 2023
[Paper] [GitHub]
(GeoGalactica)GeoGalactica: A Scientific Large Language Model in GeosciencearXiv 2024
[Paper] [GitHub] [Model (30B)]
(UrbanKGent)UrbanKGent: A Unified Large Language Model Agent Framework for Urban Knowledge Graph ConstructionNeurIPS 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (13B, LLaMA-2)] [Model (8B, LLaMA-3)]
(UrbanGPT)UrbanGPT: Spatio-Temporal Large Language ModelsKDD 2024
[Paper] [GitHub] [Model (7B)]
(JiuZhou)JiuZhou: Open Foundation Language Models and Effective Pre-training Framework for GeoscienceInternational Journal of Digital Earth 2025
[Paper] [GitHub] [Model (7B)]

Language + Graph

(ERNIE-GeoL)ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu MapsKDD 2022
[Paper]
(PK-Chat)PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue ModelarXiv 2023
[Paper] [GitHub]

Language + Vision

(GeoCLIP)GeoCLIP: CLIP-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localizationNeurIPS 2023
[Paper] [GitHub]
(UrbanCLIP)UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the WebWWW 2024
[Paper] [GitHub]

Other Modalities (Climate Time Series)

(FourCastNet)FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural OperatorsarXiv 2022
[Paper] [GitHub]
(Pangu-Weather)Accurate Medium-Range Global Weather Forecasting with 3D Neural NetworksNature 2023
[Paper] [GitHub]
(GraphCast)Learning Skillful Medium-Range Global Weather ForecastingScience 2023
[Paper] [GitHub]
(ClimaX)ClimaX: A Foundation Model for Weather and ClimateICML 2023
[Paper] [GitHub]
(FengWu)FengWu: Pushing the Skillful Global Medium-Range Weather Forecast beyond 10 Days LeadarXiv 2023
[Paper] [GitHub]
(W-MAE)W-MAE: Pre-trained Weather Model with Masked Autoencoder for Multi-Variable Weather ForecastingarXiv 2023
[Paper] [GitHub]
(FuXi)FuXi: A Cascade Machine Learning Forecasting System for 15-day Global Weather Forecastnpj Climate and Atmospheric Science 2023
[Paper] [GitHub]
(Stormer)Scaling Transformer Neural Networks for Skillful and Reliable Medium-Range Weather ForecastingNeurIPS 2024
[Paper] [GitHub]
(Aurora)A Foundation Model for the Earth SystemarXiv 2024
[Paper] [GitHub]
(Prithvi WxC)Prithvi WxC: Foundation Model for Weather and ClimatearXiv 2024
[Paper] [GitHub] [Model (2.3B)]

Citation

If you find this repository useful, please cite the following paper:

@inproceedings{zhang2024comprehensive,  title={A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery},  author={Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei},  booktitle={EMNLP'24},  pages={8783--8817},  year={2024}}

About

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

arxiv.org/abs/2406.10833

Movatterモバイル変換

License

yuzhimanhua/Awesome-Scientific-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome Scientific Language Models

Contents

General

Language

Language + Graph

Mathematics

Language

Language + Vision

Other Modalities (Table)

Physics

Language

Chemistry and Materials Science

Language

Language + Graph

Language + Vision

Other Modalities (Molecule)

Biology and Medicine

Language

Language + Graph

Language + Vision

Other Modalities (Protein)

Other Modalities (DNA)

Other Modalities (RNA)

Other Modalities (Multiomics)

Geography, Geology, and Environmental Science

Language

Language + Graph

Language + Vision

Other Modalities (Climate Time Series)

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Contributors6

Packages