- Notifications
You must be signed in to change notification settings - Fork129
Bringing BERT into modernity via both architecture changes and scaling
License
AnswerDotAI/ModernBERT
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This is the repository where you can find ModernBERT, our experiments to bring BERT into modernity via both architecture changes and scaling.
This repository noticeably introduces FlexBERT, our modular approach to encoder building blocks, and heavily relies on .yaml configuration files to build models. The codebase builds uponMosaicBERT, and specifically theunmerged fork bringing Flash Attention 2 to it, under the terms of its Apache 2.0 license. We extend our thanks to MosaicML for starting the work on modernising encoders!
This README is very barebones and is still under construction. It will improve with more reproducibility and documentation in the new year, as we gear up for more encoder niceties after the pre-holidays release of ModernBERT. For now, we're mostly looking forward to seeing what people build with the🤗 model checkpoints).
For more details on what this repository brings, we recommend reading ourrelease blog post for a high-level overview, and ourarXiv preprint for more technical details.
All code used in this repository is the code used as part of our experiments for both pre-training and GLUE evaluations, there's no uncommitted secret training sauce.
This is the research repository for ModernBERT, focused on pre-training and evaluations. If you're seeking the HuggingFace version, designed to integrate with any common pipeline, please head to theModernBERT Collection on HuggingFace
ModernBERT is a collaboration betweenAnswer.AI,LightOn, and friends.
We have fully documented the environment used to train ModernBERT, which can be installed on a GPU-equipped machine with the following commands:
conda env create -f environment.yaml# if the conda environment errors out set channel priority to flexible:# conda config --set channel_priority flexibleconda activate bert24# if using H100s clone and build flash attention 3# git clone https://github.com/Dao-AILab/flash-attention.git# cd flash-attention/hopper# python setup.py install# install flash attention 2 (model uses FA3+FA2 or just FA2 if FA3 isn't supported)pip install"flash_attn==2.6.3" --no-build-isolation# or download a precompiled wheel from https://github.com/Dao-AILab/flash-attention/releases/tag/v2.6.3# or limit the number of parallel compilation jobs# MAX_JOBS=8 pip install "flash_attn==2.6.3" --no-build-isolation
Training heavily leverages thecomposer framework. All training are configured via YAML files, of which you can find examples in theyamls folder. We highly encourage you to check out one of the example yamls, such asyamls/main/flex-bert-rope-base.yaml, to explore the configuration options.
To run a training job usingyamls/main/modernbert-base.yaml on all available GPUs, use the following command.
composer main.py yamls/main/modernbert-base.yamlThere are two dataset classes to choose between:
StreamingTextDataset
- inherits fromStreamingDataset
- uses MDS, CSV/TSV or JSONL format
- Supports both text and tokenized data
- can be used with local data as well
- WARNING: we found distribution of memory over accelerators to be uneven
NoStreamingDataset
- requires decompressed MDS-format, compressed MDS-data can be decompressed usingsrc/data/mds_conversion.py with the
--decompressflag. - Supports both text and tokenized data
When data is being accessed from local, we recommend usingNoStreamingDataset as it enabled higher training throughput in our setting. Both classes are located insrc/text_data.py, and the class to be used for a dataset can be set for each data_loader and dataset by setting streaming: true (StreamingTextDataset) or false (NoStreamingDataset).
train_loader: name: text dataset: streaming: falseTo get started, you can experiment with c4 data using thefollowing instructions.
GLUE evaluations for a ModernBERT model trained with this repository can be ran with viarun_evals.py, by providing it with a checkpoint and a training config. To evaluate non-ModernBERT models, you should useglue.py in conjunction with a slightly different training YAML, of which you can find examples in theyamls/finetuning folder.
Theexamples subfolder contains scripts for training retrieval models, both dense models based onSentence Transformers and ColBERT models via thePyLate library:
examples/train_pylate.py: The boilerplate code to train a ModernBERT-based ColBERT model with PyLate.examples/train_st.py: The boilerplate code to train a ModernBERT-based dense retrieval model with Sentence Transformers.examples/evaluate_pylate.py: The boilerplate code to evaluate a ModernBERT-based ColBERT model with PyLate.examples/evaluate_st.py: The boilerplate code to evaluate a ModernBERT-based dense retrieval model with Sentence Transformers.
If you use ModernBERT in your work, be it the released models, the intermediate checkpoints (release pending) or this training repository, please cite:
@misc{modernbert,title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},year={2024},eprint={2412.13663},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2412.13663}, }
About
Bringing BERT into modernity via both architecture changes and scaling
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.