- Notifications
You must be signed in to change notification settings - Fork0
Train Models Contrastively in Pytorch
License
catchcake/contrastors
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
contrastors is contrastive learning toolkit that enables researchers and engineers to train and evaluate contrastive models efficiently.
- Built on top ofFlash Attention for fast and efficient training
- Support for training on multiple GPUs
- GradCache support for training with large batch sizes in constrained memory environments
- Huggingface Support for easy loading of common models (Pythia/GPTNeoX, BERT, etc.)
- Masked Language Modeling (MLM) Pretraining
- Matryoshka Representation Learning for flexible embedding sizes
- CLIP andLiT style contrastive learning
- Support for loading popular ViT (e.g.timm) models
- Nomic Embed: Training a Reproducible Long Context Text Embedder by Zach Nussbaum, Jack Morris, Andriy Mulyar, and Brandon Duderstadt
- Nomic Embed Vision: Expanding the Latent Space by Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar
- Training Sparse Mixture Of Experts Text Embedding Models by Zach Nussbaum and Brandon Duderstadt
Thecontrastors library relies on custom kernels from theFlash Attention repository. To setup your enviornment you will need to follow the steps below.
Make sure that you have Cuda 11.8+. You can check this by runningnvcc --version or if you already have torch installed you can runpython -c "import torch; print(torch.version.cuda)"
Create a python venv and activate it
python3 -m venv envsource env/bin/activateInstalltorch. See the torch docs for specific instructions for your system (e.g. the default CUDA torch supports is 12.1 as of 12/12/2023).
pip3 install torch torchvision torchaudio
Install wheel, packaging, ninja for Flash Attention (so the builds don't take too long)
pip install wheel packaging ninja setuptools
Install Flash Attention and the custom kernels
pip install --no-cache-dir flash-attn --no-build-isolation git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/layer_norm git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/fused_dense_lib git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/xentropy
Install the rest of the requirements and the package
pip install -e.We provide access to thenomic-embed-text-v1 dataset via thenomic package. To access the data, you will need to create an account and login to thenomic package. First create an account atatlas.nomic.ai, download thenomic Python client, and run the following commands:
pip install nomicnomic login# follow prompts to loginpython -c"from nomic import atlas; print(atlas._get_datastream_credentials(name='contrastors'))"
which will print out your access keys. You can then configure them by usingaws configure or settingtheAWS_ACCESS_KEY_ID andAWS_SECRET_ACCESS_KEY environment variables.
If you do not have the AWS CLI installed, you can install ithere.
To verify your access, you can run the following command to list the contents of the bucket:
aws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastiveaws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastive-index-filtered
You should be able to see the contents of the bucket and download the data.
If you intend to train using our data and thecontrastors repo, you will need to setupfsspec support for Cloudflare R2. To do so,create a file~/.config/fsspec/s3.json with the following contents:
{"s3": {"client_kwargs": {"endpoint_url":"https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/","aws_access_key_id":<ACCESS_KEY_ID>,"aws_secret_access_key":<SECRET_KEY_ID> } }}Our text data is stored in gziped jsonl files with which we also store acounts.json file andoffsets.json.gzip.
Thecounts.json file is a dictionary mapping the file name to the number of examples in the file. Theoffsets.json.gz file is a dictionary mapping the file name to a dictionary where each key is the index of the example and the value is a tuple of the start and end byte offset of the example in the file. We do this to allow for streaming of data in from R2, especially when the data is larger than the buffer size.
Here's a small example of what a dataset configuration might look like:
datasets: -name:"paq"bucket:"s3://contrastive-index-filtered/paq_full/shard-{00000..00538}.jsonl.gz"query_prefix:"search_query"document_prefix:"search_document"objective:type:"paired"columns:["query", "document"]
objective defines if it's a paired or triplet objective. In both cases, thecolumns field defines the columns to use for each example.
To train your own BERT from scratch (with all the optimizations) run
cd src/contrastorsdeepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16To launch an experiment run
cd src/contrastorstorchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16This will train a bert model on all ~200M examples. To change the dataset, you can modifydata_args.input_shards.
To finetunenomic-bert-embed-v1-unsupervised, update the config toconfigs/train/contrastive_finetune.yaml.
To generate your own data for any step of the pipeline, you can use the provided scripts inscripts/text.
See theREADME inscripts/text for more information.
To align a vision model, you will need to curate a large image-text dataset. More details can be foundhere.
To alignnomic-embed-vision-v1.5 withnomic-embed-text-v1.5, you can run the following command:
deepspeed train.py --deepspeed_config=configs/deepspeed/image_text.json --config=configs/train/nomic_embed_vision_v1.5.yaml --dtype=bf16
We provide pretrained models forNomic Embed at the following locations:
- nomic-embed-text-v2-moe
- nomic-embed-text-v2-moe-unsupervised
- nomic-embed-text-v1
- nomic-embed-vision-v1
- nomic-embed-text-v1.5
- nomic-embed-vision-v1.5
- nomic-embed-text-v1-ablated
- nomic-embed-text-v1-unsupervised
- nomic-bert-2048
- nomic-xlm-2048
- Nomic:https://nomic.ai
- Discord:https://discord.gg/myY5YDR8z8
- Twitter:https://twitter.com/nomic_ai
This code is licensed under theApache 2.0 License. See the model cards for the individual license for each model.
We thank Tri Dao for his work on Flash Attention and the custom kernels that make this project possible, theOpenCLIP team for theirgreat repository with which much of this work is based on, and the Huggingface team for their great work on the transformers library.
If you find the model, dataset, or training code useful, please cite our work
@misc{nussbaum2024nomic,title={Nomic Embed: Training a Reproducible Long Context Text Embedder},author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2402.01613},archivePrefix={arXiv},primaryClass={cs.CL}}@misc{nussbaum2024nomicembedvisionexpanding,title={Nomic Embed Vision: Expanding the Latent Space},author={Zach Nussbaum and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2406.18587},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2406.18587}, }@misc{nussbaum2025trainingsparsemixtureexperts,title={Training Sparse Mixture Of Experts Text Embedding Models},author={Zach Nussbaum and Brandon Duderstadt},year={2025},eprint={2502.07972},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2502.07972}, }
About
Train Models Contrastively in Pytorch
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Languages
- Python99.1%
- Other0.9%
