The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.
Antonis Maronikolakis, Philipp Dufter, and Hinrich Schütze. 2021.Wine is not v i n. On the Compatibility of Tokenizations across Languages. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 2382–2399, Punta Cana, Dominican Republic. Association for Computational Linguistics.
@inproceedings{maronikolakis-etal-2021-wine-v, title = "Wine is not v i n. On the Compatibility of Tokenizations across Languages", author = {Maronikolakis, Antonis and Dufter, Philipp and Sch{\"u}tze, Hinrich}, editor = "Moens, Marie-Francine and Huang, Xuanjing and Specia, Lucia and Yih, Scott Wen-tau", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.205/", doi = "10.18653/v1/2021.findings-emnlp.205", pages = "2382--2399", abstract = "The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., {\textquotedblleft}wine{\textquotedblright} (word-level) in English vs. {\textquotedblleft}v i n{\textquotedblright} (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible {--} a desideratum that so far has been neglected in multilingual models."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="maronikolakis-etal-2021-wine-v"> <titleInfo> <title>Wine is not v i n. On the Compatibility of Tokenizations across Languages</title> </titleInfo> <name type="personal"> <namePart type="given">Antonis</namePart> <namePart type="family">Maronikolakis</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philipp</namePart> <namePart type="family">Dufter</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hinrich</namePart> <namePart type="family">Schütze</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2021-11</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: EMNLP 2021</title> </titleInfo> <name type="personal"> <namePart type="given">Marie-Francine</namePart> <namePart type="family">Moens</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xuanjing</namePart> <namePart type="family">Huang</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lucia</namePart> <namePart type="family">Specia</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Scott</namePart> <namePart type="given">Wen-tau</namePart> <namePart type="family">Yih</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Punta Cana, Dominican Republic</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.</abstract> <identifier type="citekey">maronikolakis-etal-2021-wine-v</identifier> <identifier type="doi">10.18653/v1/2021.findings-emnlp.205</identifier> <location> <url>https://aclanthology.org/2021.findings-emnlp.205/</url> </location> <part> <date>2021-11</date> <extent unit="page"> <start>2382</start> <end>2399</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T Wine is not v i n. On the Compatibility of Tokenizations across Languages%A Maronikolakis, Antonis%A Dufter, Philipp%A Schütze, Hinrich%Y Moens, Marie-Francine%Y Huang, Xuanjing%Y Specia, Lucia%Y Yih, Scott Wen-tau%S Findings of the Association for Computational Linguistics: EMNLP 2021%D 2021%8 November%I Association for Computational Linguistics%C Punta Cana, Dominican Republic%F maronikolakis-etal-2021-wine-v%X The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.%R 10.18653/v1/2021.findings-emnlp.205%U https://aclanthology.org/2021.findings-emnlp.205/%U https://doi.org/10.18653/v1/2021.findings-emnlp.205%P 2382-2399
[Wine is not v i n. On the Compatibility of Tokenizations across Languages](https://aclanthology.org/2021.findings-emnlp.205/) (Maronikolakis et al., Findings 2021)
Antonis Maronikolakis, Philipp Dufter, and Hinrich Schütze. 2021.Wine is not v i n. On the Compatibility of Tokenizations across Languages. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 2382–2399, Punta Cana, Dominican Republic. Association for Computational Linguistics.