- Notifications
You must be signed in to change notification settings - Fork1
Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
License
ZhaoPeiduo/BLIP2-Japanese
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project builds uponLAVIS library's BLIP2 mdoel.
The main idea is to replace the tokenizer and the underlying BERT model in Blip2's Qformer with the one trained on Japanese datasets and retrain the upated model on Japanese captioning datasets.
The model has been trained using COCO dataset withSTAIR captions.
The weights of Blip2_Japanese_qformer trained on STAIR can be obtained fromhugging face.
Copy the whole folder under lavis directory, make sure the directory is called pretrained.
Moreover, download bert-base-japanese-whole-word-masking weights and config fromthe hugging face link
You should now be able to run the example.ipynb notebook.
For directory naming conventions, you can also refer to the .gitignore file.
Captions generated forflickr30k dataset can be found in flickr30k_caption.json. Script in flickr30k_caption_generate.ipynb.
These captions are generated using top-k sampling instead of nucleus.
Captions generated by the pretrained and finetuned models are shown below:
pretrained: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で フリスビー を し て いる']} # No frisbee
finetuned: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で 喧嘩 を し て いる']}
pretrained: {'image': '1001573224.jpg', 'caption': ['6 人 の 女性 が 屋内 で 飛び跳ね て いる']} # Wrong head count
finetuned: {'image': '1001573224.jpg', 'caption': ['黒い 服 を 着 た 女性 たち が 飛び跳ね て いる']}
In general, captions generated by the finetuned model are more accurate.
Refer to the example.ipynb notebooks for more details. The idea is to get the average cosine similarity of query tokens between the image embeddings and the multimodal embeddings.
The model was trained on a single GTX4080 GPU(laptop). Hence the config during training is modified as follows:
In blip2_pretrain.yaml: vit_precision = 'fp16'
In pretrain_stage1.yaml: batch_size = 25
During evaluation you have to change vit_precision back to fp32.
The pretrained and finetuned weights may be updated without prior notice. So if you cannot reproduce the results in the exmaple notebook, please re-download the weights and try again.
A simple interface for demo purpose can be found in generator-ui.py. To run the UI:
python generator-ui.py
About
Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.