- Notifications
You must be signed in to change notification settings - Fork0
lmmtoolkit is a toolkit for Multi-Modal Learning
License
jianzhnie/MultimodalTransformers
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a simple implementation ofNatural Language-based Image Search inspired by theCLIP approach as proposed by the paperLearning Transferable Visual Models From Natural Language Supervision by OpenAI inPyTorch Lightning. We also useWeights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.
python examples/train_clip.py
This command will initialize a CLIP model with aResNet50 image backbone and adistilbert-base-uncased text backbone.
CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.
You can read more about CLIPhere andhere
This implementation of CLIP supports training on two datasetsFlickr8k which contains ~8K images with 5 captions for each image andFlickr30k which contains ~30K images with corresponding captions.
A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models fromPyTorch Image Models and transformer models fromhuggingface transformers.
About
lmmtoolkit is a toolkit for Multi-Modal Learning
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.