A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work—incurring additional costs by orders of magnitude—which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, and Mor Geva. 2023.A Comprehensive Evaluation of Tool-Assisted Generation Strategies. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 13856–13878, Singapore. Association for Computational Linguistics.
@inproceedings{jacovi-etal-2023-comprehensive, title = "A Comprehensive Evaluation of Tool-Assisted Generation Strategies", author = "Jacovi, Alon and Caciularu, Avi and Herzig, Jonathan and Aharoni, Roee and Bohnet, Bernd and Geva, Mor", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.926/", doi = "10.18653/v1/2023.findings-emnlp.926", pages = "13856--13878", abstract = "A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work{---}incurring additional costs by orders of magnitude{---}which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="jacovi-etal-2023-comprehensive"> <titleInfo> <title>A Comprehensive Evaluation of Tool-Assisted Generation Strategies</title> </titleInfo> <name type="personal"> <namePart type="given">Alon</namePart> <namePart type="family">Jacovi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Avi</namePart> <namePart type="family">Caciularu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jonathan</namePart> <namePart type="family">Herzig</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Roee</namePart> <namePart type="family">Aharoni</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bernd</namePart> <namePart type="family">Bohnet</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mor</namePart> <namePart type="family">Geva</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-12</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: EMNLP 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Houda</namePart> <namePart type="family">Bouamor</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Juan</namePart> <namePart type="family">Pino</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kalika</namePart> <namePart type="family">Bali</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Singapore</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work—incurring additional costs by orders of magnitude—which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.</abstract> <identifier type="citekey">jacovi-etal-2023-comprehensive</identifier> <identifier type="doi">10.18653/v1/2023.findings-emnlp.926</identifier> <location> <url>https://aclanthology.org/2023.findings-emnlp.926/</url> </location> <part> <date>2023-12</date> <extent unit="page"> <start>13856</start> <end>13878</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T A Comprehensive Evaluation of Tool-Assisted Generation Strategies%A Jacovi, Alon%A Caciularu, Avi%A Herzig, Jonathan%A Aharoni, Roee%A Bohnet, Bernd%A Geva, Mor%Y Bouamor, Houda%Y Pino, Juan%Y Bali, Kalika%S Findings of the Association for Computational Linguistics: EMNLP 2023%D 2023%8 December%I Association for Computational Linguistics%C Singapore%F jacovi-etal-2023-comprehensive%X A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work—incurring additional costs by orders of magnitude—which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.%R 10.18653/v1/2023.findings-emnlp.926%U https://aclanthology.org/2023.findings-emnlp.926/%U https://doi.org/10.18653/v1/2023.findings-emnlp.926%P 13856-13878
Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, and Mor Geva. 2023.A Comprehensive Evaluation of Tool-Assisted Generation Strategies. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 13856–13878, Singapore. Association for Computational Linguistics.