jianzhnie/MultimodalTransformersPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

lmmtoolkit is a toolkit for Multi-Modal Learning

License

Apache-2.0 license

2 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
lmms		lmms
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Repository files navigation

MultimodalTransformers

CLIP

This is a simple implementation ofNatural Language-based Image Search inspired by theCLIP approach as proposed by the paperLearning Transferable Visual Models From Natural Language Supervision by OpenAI inPyTorch Lightning. We also useWeights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.

python examples/train_clip.py

This command will initialize a CLIP model with aResNet50 image backbone and adistilbert-base-uncased text backbone.

📚 CLIP: Connecting Text and Images

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.

You can read more about CLIPhere andhere

💿 Dataset

This implementation of CLIP supports training on two datasetsFlickr8k which contains ~8K images with 5 captions for each image andFlickr30k which contains ~30K images with corresponding captions.

🤖 Model

A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models fromPyTorch Image Models and transformer models fromhuggingface transformers.

About

lmmtoolkit is a toolkit for Multi-Modal Learning

jianzhnie.github.io/llmtech/

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MultimodalTransformers

CLIP

📚 CLIP: Connecting Text and Images

💿 Dataset

🤖 Model

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

jianzhnie/MultimodalTransformers

Folders and files

Latest commit

History

Repository files navigation

MultimodalTransformers

CLIP

📚 CLIP: Connecting Text and Images

💿 Dataset

🤖 Model

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages