- Notifications
You must be signed in to change notification settings - Fork2
Winner solution to Generic Event Boundary Captioning task in LOVEU Challenge (CVPR 2023 workshop)
License
BSD-3-Clause and 3 other licenses found
Licenses found
zjr2000/LLMVA-GEBC
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for theLOVEU@CVPR2023 Workshop Generic Event Boundary Captioning (GEBC) Chanllenge. Our proposed method achieved a76.14 score on the test set and won the
We proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning):(1) We utilize a pretrained LLM for generating human-like captions with high quality.(2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM.
First, you should create a conda environment:
conda env create -f environment.ymlconda activate llmvagebc
Before using the repository, make sure you have obtained the following checkpoints:
Remember to change the path of checkpointsckpt
in the config file.
Download the Kinetic-GEBC dataset fromhttps://sites.google.com/view/loveucvpr23/track2.
For primary visual feature:UsingBLIP-2 to extract primary visual features. We usefeature_extraction.py
to do so. Remember to change thevideo_dir
andsave_dir
intrain_configs/blip2_feature_extract.yaml
before you run:
python feature_extraction.py --cfg-path train_configs/blip2_feature_extract.yaml
For other visual features:CLIP to extract frame-level features andOmnivore to extract clip-level features. We usethis pipeline to extract features.
Then, put the extracted features under these three folders:
data/features/eva_vit_g_q_former_tokens_12data/features/clip_fps_15_stride_1_rename,data/features/omnivore_fps_15_len_16_stride_1_rename
You can also directly download the official provided featureshere. But, remember to change theq_former_feature_folder
,other_feat_total_size
,other_feature_names
andother_feature_folders
in the config file.
UsingVinVL to extract region-level features. The region feature of a video is saved to multiple.npy
files, where each single file contains the region feature of a sampled frame. Merge the feature file paths intovideo_to_frame_index.json
in the following format:
{ "video_id": [ "frame_1_feat.npy", "frame_2_feat.npy", ... ], ...}
Then put this file underdata/features/
.
Firstly, set the configs intrain_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml
.Then run the script
CUDA_VISIBLE_DEVICES=${YOUR_GPU_ID} python train.py \ --cfg-path train_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml
The results can be found invideo_llama/output/
.
We are grateful for the following awesome projects our LLMVA-GEBC arising from:
If you find our code useful, please cite the repo as follows:
@article{tang2023llmva,title={LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning},author={Tang, Yunlong and Zhang, Jinrui and Wang, Xiangchen and Wang, Teng and Zheng, Feng},journal={arXiv preprint arXiv:2306.10354},year={2023}}
About
Winner solution to Generic Event Boundary Captioning task in LOVEU Challenge (CVPR 2023 workshop)