catchcake/contrastorsPublic

forked fromnomic-ai/contrastors

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Train Models Contrastively in Pytorch

License

Apache-2.0 license

0 stars 61 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
scripts		scripts
src/contrastors		src/contrastors
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
convert_to_hf.py		convert_to_hf.py
requirements.txt		requirements.txt
setup.py		setup.py
test_grad_cache.py		test_grad_cache.py
tokenizer_compare.py		tokenizer_compare.py

Repository files navigation

contrastors

contrastors is contrastive learning toolkit that enables researchers and engineers to train and evaluate contrastive models efficiently.

Features

Built on top ofFlash Attention for fast and efficient training
Support for training on multiple GPUs
GradCache support for training with large batch sizes in constrained memory environments
Huggingface Support for easy loading of common models (Pythia/GPTNeoX, BERT, etc.)
Masked Language Modeling (MLM) Pretraining
Matryoshka Representation Learning for flexible embedding sizes
CLIP andLiT style contrastive learning
Support for loading popular ViT (e.g.timm) models

Research

Nomic Embed: Training a Reproducible Long Context Text Embedder by Zach Nussbaum, Jack Morris, Andriy Mulyar, and Brandon Duderstadt
Nomic Embed Vision: Expanding the Latent Space by Zach Nussbaum, Brandon Duderstadt, and Andriy Mulyar
Training Sparse Mixture Of Experts Text Embedding Models by Zach Nussbaum and Brandon Duderstadt

Getting Started and Requirements

Thecontrastors library relies on custom kernels from theFlash Attention repository. To setup your enviornment you will need to follow the steps below.

Make sure that you have Cuda 11.8+. You can check this by runningnvcc --version or if you already have torch installed you can runpython -c "import torch; print(torch.version.cuda)"

Create a python venv and activate it

python3 -m venv envsource env/bin/activate

Installtorch. See the torch docs for specific instructions for your system (e.g. the default CUDA torch supports is 12.1 as of 12/12/2023).

pip3 install torch torchvision torchaudio

Install wheel, packaging, ninja for Flash Attention (so the builds don't take too long)

pip install wheel packaging ninja setuptools

Install Flash Attention and the custom kernels

pip install --no-cache-dir flash-attn --no-build-isolation git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/layer_norm git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/fused_dense_lib git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/xentropy

Install the rest of the requirements and the package

pip install -e.

Data Access

We provide access to thenomic-embed-text-v1 dataset via thenomic package. To access the data, you will need to create an account and login to thenomic package. First create an account atatlas.nomic.ai, download thenomic Python client, and run the following commands:

pip install nomicnomic login# follow prompts to loginpython -c"from nomic import atlas; print(atlas._get_datastream_credentials(name='contrastors'))"

which will print out your access keys. You can then configure them by usingaws configure or settingtheAWS_ACCESS_KEY_ID andAWS_SECRET_ACCESS_KEY environment variables.

If you do not have the AWS CLI installed, you can install ithere.

To verify your access, you can run the following command to list the contents of the bucket:

aws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastiveaws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastive-index-filtered

You should be able to see the contents of the bucket and download the data.

If you intend to train using our data and thecontrastors repo, you will need to setupfsspec support for Cloudflare R2. To do so,create a file~/.config/fsspec/s3.json with the following contents:

{"s3": {"client_kwargs": {"endpoint_url":"https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/","aws_access_key_id":<ACCESS_KEY_ID>,"aws_secret_access_key":<SECRET_KEY_ID>    }  }}

Nomic Data Format

Our text data is stored in gziped jsonl files with which we also store acounts.json file andoffsets.json.gzip.

Thecounts.json file is a dictionary mapping the file name to the number of examples in the file. Theoffsets.json.gz file is a dictionary mapping the file name to a dictionary where each key is the index of the example and the value is a tuple of the start and end byte offset of the example in the file. We do this to allow for streaming of data in from R2, especially when the data is larger than the buffer size.

Here's a small example of what a dataset configuration might look like:

datasets:  -name:"paq"bucket:"s3://contrastive-index-filtered/paq_full/shard-{00000..00538}.jsonl.gz"query_prefix:"search_query"document_prefix:"search_document"objective:type:"paired"columns:["query", "document"]

objective defines if it's a paired or triplet objective. In both cases, thecolumns field defines the columns to use for each example.

Training`nomic-embed-text-v1`

Masked Language Modeling Pretraining

To train your own BERT from scratch (with all the optimizations) run

cd src/contrastorsdeepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16

Constrastive Pretraining and Finetuning

To launch an experiment run

cd src/contrastorstorchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16

This will train a bert model on all ~200M examples. To change the dataset, you can modifydata_args.input_shards.

To finetunenomic-bert-embed-v1-unsupervised, update the config toconfigs/train/contrastive_finetune.yaml.

Generating Your Own Data

To generate your own data for any step of the pipeline, you can use the provided scripts inscripts/text.

See theREADME inscripts/text for more information.

Training`nomic-embed-vision-v1.5`

To align a vision model, you will need to curate a large image-text dataset. More details can be foundhere.

To alignnomic-embed-vision-v1.5 withnomic-embed-text-v1.5, you can run the following command:

deepspeed  train.py --deepspeed_config=configs/deepspeed/image_text.json --config=configs/train/nomic_embed_vision_v1.5.yaml --dtype=bf16

Pretrained Models

We provide pretrained models forNomic Embed at the following locations:

Join the Nomic Community

License

This code is licensed under theApache 2.0 License. See the model cards for the individual license for each model.

Acknowledgements

We thank Tri Dao for his work on Flash Attention and the custom kernels that make this project possible, theOpenCLIP team for theirgreat repository with which much of this work is based on, and the Huggingface team for their great work on the transformers library.

Citation

If you find the model, dataset, or training code useful, please cite our work

@misc{nussbaum2024nomic,title={Nomic Embed: Training a Reproducible Long Context Text Embedder},author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2402.01613},archivePrefix={arXiv},primaryClass={cs.CL}}@misc{nussbaum2024nomicembedvisionexpanding,title={Nomic Embed Vision: Expanding the Latent Space},author={Zach Nussbaum and Brandon Duderstadt and Andriy Mulyar},year={2024},eprint={2406.18587},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2406.18587}, }@misc{nussbaum2025trainingsparsemixtureexperts,title={Training Sparse Mixture Of Experts Text Embedding Models},author={Zach Nussbaum and Brandon Duderstadt},year={2025},eprint={2502.07972},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2502.07972}, }

About

Train Models Contrastively in Pytorch

Releases

No releases published

Packages

No packages published

Languages

Python99.1%
Other0.9%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

contrastors

Features

Research

Getting Started and Requirements

Data Access

Nomic Data Format

Training`nomic-embed-text-v1`

Masked Language Modeling Pretraining

Constrastive Pretraining and Finetuning

Generating Your Own Data

Training`nomic-embed-vision-v1.5`

Pretrained Models

Join the Nomic Community

License

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

catchcake/contrastors

Folders and files

Latest commit

History

Repository files navigation

contrastors

Features

Research

Getting Started and Requirements

Data Access

Nomic Data Format

Trainingnomic-embed-text-v1

Masked Language Modeling Pretraining

Constrastive Pretraining and Finetuning

Generating Your Own Data

Trainingnomic-embed-vision-v1.5

Pretrained Models

Join the Nomic Community

License

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Training`nomic-embed-text-v1`

Training`nomic-embed-vision-v1.5`

Packages