- Notifications
You must be signed in to change notification settings - Fork0
naist-nlp/multils-japanese
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
MultiLS-Japanese is a lexical complexity prediction (LCP) and lexical simplification (LS) dataset for Japanese.
This repository provides:
Additional data for theoriginal annotation, which was used to evaluate the MLSP 2024 shared task:
- LCP andLS annotator profiles. Note that each instance in both trial and test data was annotated by the the same annotators.
- Unaggregated trial andtest ratings for LCP that can be merged with the Japanese dataset using the
id
column. - Empty Excel templates used for annotation including our annotation guidelines and the exact questions we asked in the annotator profiles.
Non-Chinese/Korean L1 replication of the LCP trial set:
- Annotator profiles.
- Ratings (both unaggregated and aggregated values).
Chinese L1 reannotation of the LCP trial set:
- Annotator profiles.
- Ratings (both unaggregated and aggregated values).
The last two trial set annotations were used for analysis in “Difficult for Whom? A Study of Japanese Lexical Complexity” (Nohejl et al., 2024). Only the original data was used for the MLSP shared task (Shardlow et al., 2024).
MultiLS-Japanese (only Japanese language) with extended annotation fields on Hugging Face Hub.
Please get the data for all 10 languages, including Japanese (original annotation), from theMLSP2024 dataset on Hugging Face Hub. Thismultils-japanese
repository only provides additional data specific for the Japanese subset of MultiLS (MLSP2024) dataset.
All annotated data is available exclusively via Hugging Face, not to leak it to the open web, where it can contaminate LLM training data.
The MultiLS-Japanese dataset was created by Adam Nohejl, Akio Haykawa, and Yusuke Ide.A journal paper about the dataset is scheduled for publication in December 2025.
@inproceedings{nohejl-etal-2024-difficult, title = {Difficult for {{Whom}}? {{A Study}} of {{Japanese Lexical Complexity}}}, author = {Nohejl, Adam and Hayakawa, Akio and Ide, Yusuke and Watanabe, Taro}, booktitle = "Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)", year = {2024}, url = "https://aclanthology.org/2024.tsar-1.8",}
@inproceedings{shardlow2024bea, title={{The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline}}, author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and \v{S}tajner, Sanja and Zampieri, Marcos and Saggion, Horacio}, booktitle={Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},year={2024}}
@inproceedings{shardlow2024readi, title={{An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework}}, author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and Zampieri, Marcos and Saggion, Horacio}, booktitle={Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)},year={2024}}
JaLeCoN, a Dataset of Japanese Lexical Complexity for Non-Native Readers
Paper. The annotation was done with a slightly different scale and in a dense setting.
@inproceedings{ide2023, title = "Japanese Lexical Complexity for Non-Native Readers: A New Dataset", author = "Ide, Yusuke and Mita, Masato and Nohejl, Adam and Ouchi, Hiroki and Watanabe, Taro", booktitle = "Proceedings of the Eighteenth Workshop on Innovative Use of {NLP} for Building Educational Applications", month = July, year = 2023, publisher = "Association for Computational Linguistics",}
@article{north2024multils, title={MultiLS: A Multi-task Lexical Simplification Framework}, author={North, Kai and Ranasinghe, Tharindu and Shardlow, Matthew and Zampieri, Marcos}, journal={arXiv preprint arXiv:2402.14972}, year={2024}}
@misc{bott2024multilsspca, title={MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish}, author={Stefan Bott and Horacio Saggion and Nelson Peréz Rojas and Martin Solis Salazar and Saul Calderon Ramirez}, year={2024}, eprint={2404.07814}, archivePrefix={arXiv}, primaryClass={cs.CL}}
You may also be interested in Japanese lexical simplification datasets targeting native speakers (by different authors):
- Controlled and Balanced Dataset for Japanese Lexical Simplification (Kodaira et al., 2016):dataset
- Evaluation Dataset and System for Japanese Lexical Simplification (Kajiwara and Yamamoto, 2015:dataset
This work is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License. Please, citeourpapers if you use the data.
See the sources and license information fortrial set andtest set for details.
About
MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.