Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and **F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
@inproceedings{lee-etal-2024-ifcap, title = "{IFC}ap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning", author = "Lee, Soeun and Kim, Si-Woo and Kim, Taewhan and Kim, Dong-Jin", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.1153/", doi = "10.18653/v1/2024.emnlp-main.1153", pages = "20715--20727", abstract = "Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and **F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="lee-etal-2024-ifcap"> <titleInfo> <title>IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning</title> </titleInfo> <name type="personal"> <namePart type="given">Soeun</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Si-Woo</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Taewhan</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dong-Jin</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2024-11</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</title> </titleInfo> <name type="personal"> <namePart type="given">Yaser</namePart> <namePart type="family">Al-Onaizan</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mohit</namePart> <namePart type="family">Bansal</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yun-Nung</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Miami, Florida, USA</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and **F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.</abstract> <identifier type="citekey">lee-etal-2024-ifcap</identifier> <identifier type="doi">10.18653/v1/2024.emnlp-main.1153</identifier> <location> <url>https://aclanthology.org/2024.emnlp-main.1153/</url> </location> <part> <date>2024-11</date> <extent unit="page"> <start>20715</start> <end>20727</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning%A Lee, Soeun%A Kim, Si-Woo%A Kim, Taewhan%A Kim, Dong-Jin%Y Al-Onaizan, Yaser%Y Bansal, Mohit%Y Chen, Yun-Nung%S Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing%D 2024%8 November%I Association for Computational Linguistics%C Miami, Florida, USA%F lee-etal-2024-ifcap%X Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and **F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.%R 10.18653/v1/2024.emnlp-main.1153%U https://aclanthology.org/2024.emnlp-main.1153/%U https://doi.org/10.18653/v1/2024.emnlp-main.1153%P 20715-20727
[IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning](https://aclanthology.org/2024.emnlp-main.1153/) (Lee et al., EMNLP 2024)