Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.github		.github
efficiency		efficiency
examples		examples
src		src
tests		tests
yamls		yamls
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RunEvals.md		RunEvals.md
__init__.py		__init__.py
_colbert.py		_colbert.py
benchmark.py		benchmark.py
download_artifacts_from_wandb.py		download_artifacts_from_wandb.py
environment.yaml		environment.yaml
eval.py		eval.py
generate_eval_config.py		generate_eval_config.py
glue.py		glue.py
main.py		main.py
requirements-colbert.txt		requirements-colbert.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-data.txt		requirements-data.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
run_evals.py		run_evals.py
sequence_classification.py		sequence_classification.py
wandb_log_live_eval.py		wandb_log_live_eval.py

Repository files navigation

Welcome!

This is the repository where you can find ModernBERT, our experiments to bring BERT into modernity via both architecture changes and scaling.

This repository noticeably introduces FlexBERT, our modular approach to encoder building blocks, and heavily relies on .yaml configuration files to build models. The codebase builds uponMosaicBERT, and specifically theunmerged fork bringing Flash Attention 2 to it, under the terms of its Apache 2.0 license. We extend our thanks to MosaicML for starting the work on modernising encoders!

This README is very barebones and is still under construction. It will improve with more reproducibility and documentation in the new year, as we gear up for more encoder niceties after the pre-holidays release of ModernBERT. For now, we're mostly looking forward to seeing what people build with the🤗 model checkpoints).

For more details on what this repository brings, we recommend reading ourrelease blog post for a high-level overview, and ourarXiv preprint for more technical details.

All code used in this repository is the code used as part of our experiments for both pre-training and GLUE evaluations, there's no uncommitted secret training sauce.

This is the research repository for ModernBERT, focused on pre-training and evaluations. If you're seeking the HuggingFace version, designed to integrate with any common pipeline, please head to theModernBERT Collection on HuggingFace

ModernBERT is a collaboration betweenAnswer.AI,LightOn, and friends.

Setup

We have fully documented the environment used to train ModernBERT, which can be installed on a GPU-equipped machine with the following commands:

conda env create -f environment.yaml# if the conda environment errors out set channel priority to flexible:# conda config --set channel_priority flexibleconda activate bert24# if using H100s clone and build flash attention 3# git clone https://github.com/Dao-AILab/flash-attention.git# cd flash-attention/hopper# python setup.py install# install flash attention 2 (model uses FA3+FA2 or just FA2 if FA3 isn't supported)pip install"flash_attn==2.6.3" --no-build-isolation# or download a precompiled wheel from https://github.com/Dao-AILab/flash-attention/releases/tag/v2.6.3# or limit the number of parallel compilation jobs# MAX_JOBS=8 pip install "flash_attn==2.6.3" --no-build-isolation

Training

Training heavily leverages thecomposer framework. All training are configured via YAML files, of which you can find examples in theyamls folder. We highly encourage you to check out one of the example yamls, such asyamls/main/flex-bert-rope-base.yaml, to explore the configuration options.

Launch command example

To run a training job usingyamls/main/modernbert-base.yaml on all available GPUs, use the following command.

composer main.py yamls/main/modernbert-base.yaml

Data

There are two dataset classes to choose between:

StreamingTextDataset

inherits fromStreamingDataset
uses MDS, CSV/TSV or JSONL format
Supports both text and tokenized data
can be used with local data as well
WARNING: we found distribution of memory over accelerators to be uneven

NoStreamingDataset

requires decompressed MDS-format, compressed MDS-data can be decompressed usingsrc/data/mds_conversion.py with the--decompress flag.
Supports both text and tokenized data

When data is being accessed from local, we recommend usingNoStreamingDataset as it enabled higher training throughput in our setting. Both classes are located insrc/text_data.py, and the class to be used for a dataset can be set for each data_loader and dataset by setting streaming: true (StreamingTextDataset) or false (NoStreamingDataset).

train_loader:  name: text  dataset:    streaming: false

To get started, you can experiment with c4 data using thefollowing instructions.

Evaluations

GLUE

GLUE evaluations for a ModernBERT model trained with this repository can be ran with viarun_evals.py, by providing it with a checkpoint and a training config. To evaluate non-ModernBERT models, you should useglue.py in conjunction with a slightly different training YAML, of which you can find examples in theyamls/finetuning folder.

Retrieval

Theexamples subfolder contains scripts for training retrieval models, both dense models based onSentence Transformers and ColBERT models via thePyLate library:

examples/train_pylate.py: The boilerplate code to train a ModernBERT-based ColBERT model with PyLate.
examples/train_st.py: The boilerplate code to train a ModernBERT-based dense retrieval model with Sentence Transformers.
examples/evaluate_pylate.py: The boilerplate code to evaluate a ModernBERT-based ColBERT model with PyLate.
examples/evaluate_st.py: The boilerplate code to evaluate a ModernBERT-based dense retrieval model with Sentence Transformers.

Reference

If you use ModernBERT in your work, be it the released models, the intermediate checkpoints (release pending) or this training repository, please cite:

@misc{modernbert,title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},year={2024},eprint={2412.13663},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2412.13663}, }