Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

lmmtoolkit is a toolkit for Multi-Modal Learning

License

NotificationsYou must be signed in to change notification settings

jianzhnie/MultimodalTransformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP

This is a simple implementation ofNatural Language-based Image Search inspired by theCLIP approach as proposed by the paperLearning Transferable Visual Models From Natural Language Supervision by OpenAI inPyTorch Lightning. We also useWeights & Biases for experiment tracking, visualizing results, comparing performance of different backbone models, hyperparameter optimization and to ensure reproducibility.

python examples/train_clip.py

This command will initialize a CLIP model with aResNet50 image backbone and adistilbert-base-uncased text backbone.

📚 CLIP: Connecting Text and Images

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior turns CLIP into a zero-shot classifier. All of a dataset’s classes are converted into captions such as “a photo of a dog” followed by predicting the class of the caption in which CLIP estimates best pairs with a given image.

You can read more about CLIPhere andhere

💿 Dataset

This implementation of CLIP supports training on two datasetsFlickr8k which contains ~8K images with 5 captions for each image andFlickr30k which contains ~30K images with corresponding captions.

🤖 Model

A CLIP model uses a text encoder and an image encoder. This repostiry supports pulling image models fromPyTorch Image Models and transformer models fromhuggingface transformers.

About

lmmtoolkit is a toolkit for Multi-Modal Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp