Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

License

NotificationsYou must be signed in to change notification settings

foolwood/DRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSR-VTTDiDeMo

This is a PyTorch implementation of the paperDisentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},  journal = {arXiv:2203.07111},  title   = {Disentangled Representation Learning for Text-Video Retrieval},  year    = {2022},}

Catalog

  • Setup
  • Fine-tuning code
  • Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.gitcd DRLconda create -n drl python=3.9conda activate drlpip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/modelswget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTTwget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip; unzip MSRVTT.zipmv MSRVTT/videos/all ./videos; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

  • Train on MSR-VTT 1k.
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \main.py --do_train 1 --workers 8 --n_display 50 \--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \--max_words 32 --max_frames 12 --video_framerate 1 \--base_encoder ViT-B/32 --agg_module seqTransf \--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experimentsscripts

configs
featuregpusText-VideoVideo-Texttrain time (h)
R@1R@5R@10MdRMnRR@1R@5R@10MdRMnR
CLIP4ClipViT/B-32442.872.181.42.016.344.170.580.52.011.810.5
zero-shotViT/B-32431.153.763.44.041.626.550.161.75.039.9-
Interaction
DP+NoneViT/B-32442.970.681.42.015.443.071.181.12.011.82.5
DP+seqTransfViT/B-32442.871.181.12.015.644.170.980.92.011.72.6
XTI+NoneViT/B-32440.571.182.62.013.642.770.880.22.012.514.3
XTI+seqTransfViT/B-32442.471.380.92.015.240.169.279.62.015.816.8
TI+seqTransfViT/B-32444.873.082.22.013.442.672.782.82.09.12.6
WTI+seqTransfViT/B-32446.673.483.52.013.045.473.481.92.09.22.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCRViT/B-32443.971.181.22.015.342.370.381.12.011.42.6
TI+seqTransf+CDCRViT/B-32445.873.081.92.012.843.371.882.72.08.92.6
WTI+seqTransf+CDCRViT/B-32447.673.483.32.012.845.172.983.52.09.22.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo usingmatplotlib (no GPU needed):

License

SeeLICENSE for details.

Acknowledgments

Our code is partly based onCLIP4Clip.

About

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors3

  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp