foolwood/DRLPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star97

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

License

Apache-2.0 license

97 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/MSR-VTT/anns		data/MSR-VTT/anns
demo		demo
scripts		scripts
tvr		tvr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Repository files navigation

Disentangled Representation Learning for Text-Video Retrieval

This is a PyTorch implementation of the paperDisentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},  journal = {arXiv:2203.07111},  title   = {Disentangled Representation Learning for Text-Video Retrieval},  year    = {2022},}

Catalog

Setup
Fine-tuning code
Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.gitcd DRLconda create -n drl python=3.9conda activate drlpip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/modelswget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTTwget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip; unzip MSRVTT.zipmv MSRVTT/videos/all ./videos; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

Train on MSR-VTT 1k.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \main.py --do_train 1 --workers 8 --n_display 50 \--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \--max_words 32 --max_frames 12 --video_framerate 1 \--base_encoder ViT-B/32 --agg_module seqTransf \--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experimentsscripts

configs	feature	gpus	Text-Video					Video-Text					train time (h)
configs	feature	gpus	R@1	R@5	R@10	MdR	MnR	R@1	R@5	R@10	MdR	MnR	train time (h)
CLIP4Clip	ViT/B-32	4	42.8	72.1	81.4	2.0	16.3	44.1	70.5	80.5	2.0	11.8	10.5
zero-shot	ViT/B-32	4	31.1	53.7	63.4	4.0	41.6	26.5	50.1	61.7	5.0	39.9	-
Interaction
DP+None	ViT/B-32	4	42.9	70.6	81.4	2.0	15.4	43.0	71.1	81.1	2.0	11.8	2.5
DP+seqTransf	ViT/B-32	4	42.8	71.1	81.1	2.0	15.6	44.1	70.9	80.9	2.0	11.7	2.6
XTI+None	ViT/B-32	4	40.5	71.1	82.6	2.0	13.6	42.7	70.8	80.2	2.0	12.5	14.3
XTI+seqTransf	ViT/B-32	4	42.4	71.3	80.9	2.0	15.2	40.1	69.2	79.6	2.0	15.8	16.8
TI+seqTransf	ViT/B-32	4	44.8	73.0	82.2	2.0	13.4	42.6	72.7	82.8	2.0	9.1	2.6
WTI+seqTransf	ViT/B-32	4	46.6	73.4	83.5	2.0	13.0	45.4	73.4	81.9	2.0	9.2	2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR	ViT/B-32	4	43.9	71.1	81.2	2.0	15.3	42.3	70.3	81.1	2.0	11.4	2.6
TI+seqTransf+CDCR	ViT/B-32	4	45.8	73.0	81.9	2.0	12.8	43.3	71.8	82.7	2.0	8.9	2.6
WTI+seqTransf+CDCR	ViT/B-32	4	47.6	73.4	83.3	2.0	12.8	45.1	72.9	83.5	2.0	9.2	2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo usingmatplotlib (no GPU needed):

License

SeeLICENSE for details.

Acknowledgments

Our code is partly based onCLIP4Clip.

About

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

foolwood/DRL

Folders and files

Latest commit

History

Repository files navigation

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages