We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.
@inproceedings{nguyen-etal-2023-vault, title = "The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation", author = "Nguyen, Dung and Nam, Le and Dau, Anh and Nguyen, Anh and Nghiem, Khanh and Guo, Jin and Bui, Nghi", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.316/", doi = "10.18653/v1/2023.findings-emnlp.316", pages = "4763--4788", abstract = "We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="nguyen-etal-2023-vault"> <titleInfo> <title>The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation</title> </titleInfo> <name type="personal"> <namePart type="given">Dung</namePart> <namePart type="family">Nguyen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Le</namePart> <namePart type="family">Nam</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anh</namePart> <namePart type="family">Dau</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anh</namePart> <namePart type="family">Nguyen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Khanh</namePart> <namePart type="family">Nghiem</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jin</namePart> <namePart type="family">Guo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nghi</namePart> <namePart type="family">Bui</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-12</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: EMNLP 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Houda</namePart> <namePart type="family">Bouamor</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Juan</namePart> <namePart type="family">Pino</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kalika</namePart> <namePart type="family">Bali</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Singapore</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.</abstract> <identifier type="citekey">nguyen-etal-2023-vault</identifier> <identifier type="doi">10.18653/v1/2023.findings-emnlp.316</identifier> <location> <url>https://aclanthology.org/2023.findings-emnlp.316/</url> </location> <part> <date>2023-12</date> <extent unit="page"> <start>4763</start> <end>4788</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation%A Nguyen, Dung%A Nam, Le%A Dau, Anh%A Nguyen, Anh%A Nghiem, Khanh%A Guo, Jin%A Bui, Nghi%Y Bouamor, Houda%Y Pino, Juan%Y Bali, Kalika%S Findings of the Association for Computational Linguistics: EMNLP 2023%D 2023%8 December%I Association for Computational Linguistics%C Singapore%F nguyen-etal-2023-vault%X We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.%R 10.18653/v1/2023.findings-emnlp.316%U https://aclanthology.org/2023.findings-emnlp.316/%U https://doi.org/10.18653/v1/2023.findings-emnlp.316%P 4763-4788
[The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://aclanthology.org/2023.findings-emnlp.316/) (Nguyen et al., Findings 2023)