Movatterモバイル変換

Skip to content

naist-nlp/multils-japanesePublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.

0 stars 0 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
annotation_templates		annotation_templates
annotator_profiles		annotator_profiles
data		data
.gitattributes		.gitattributes
LICENSE_test.md		LICENSE_test.md
LICENSE_trial.md		LICENSE_trial.md
README.md		README.md

Repository files navigation

MultiLS-Japanese

MultiLS-Japanese is a lexical complexity prediction (LCP) and lexical simplification (LS) dataset for Japanese.

This repository provides:

Additional data for theoriginal annotation, which was used to evaluate the MLSP 2024 shared task:
- LCP andLS annotator profiles. Note that each instance in both trial and test data was annotated by the the same annotators.
- Unaggregated trial andtest ratings for LCP that can be merged with the Japanese dataset using theid column.
- Empty Excel templates used for annotation including our annotation guidelines and the exact questions we asked in the annotator profiles.
Non-Chinese/Korean L1 replication of the LCP trial set:
- Annotator profiles.
- Ratings (both unaggregated and aggregated values).
Chinese L1 reannotation of the LCP trial set:
- Annotator profiles.
- Ratings (both unaggregated and aggregated values).

The last two trial set annotations were used for analysis in “Difficult for Whom? A Study of Japanese Lexical Complexity” (Nohejl et al., 2024). Only the original data was used for the MLSP shared task (Shardlow et al., 2024).

The LS and LCP Data

MultiLS-Japanese (only Japanese language) with extended annotation fields on Hugging Face Hub.

Please get the data for all 10 languages, including Japanese (original annotation), from theMLSP2024 dataset on Hugging Face Hub. Thismultils-japanese repository only provides additional data specific for the Japanese subset of MultiLS (MLSP2024) dataset.

All annotated data is available exclusively via Hugging Face, not to leak it to the open web, where it can contaminate LLM training data.

Papers

The MultiLS-Japanese dataset was created by Adam Nohejl, Akio Haykawa, and Yusuke Ide.A journal paper about the dataset is scheduled for publication in December 2025.

MultiLS-Japanese: Analysis and Additional Annotation

@inproceedings{nohejl-etal-2024-difficult,  title = {Difficult for {{Whom}}? {{A Study}} of {{Japanese Lexical Complexity}}},  author = {Nohejl, Adam and Hayakawa, Akio and Ide, Yusuke and Watanabe, Taro},  booktitle = "Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)",  year = {2024}, url = "https://aclanthology.org/2024.tsar-1.8",}

MultiLS (all MLSP2024 data): Shared Task Report and Dataset

@inproceedings{shardlow2024bea,  title={{The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline}},  author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and \v{S}tajner, Sanja and Zampieri, Marcos and Saggion, Horacio},  booktitle={Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},year={2024}}

MultiLS (all MLSP2024 data): Dataset Creation

@inproceedings{shardlow2024readi,  title={{An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework}},  author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and Zampieri, Marcos and Saggion, Horacio},  booktitle={Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)},year={2024}}

Related Work

JaLeCoN, a Dataset of Japanese Lexical Complexity for Non-Native Readers

Paper. The annotation was done with a slightly different scale and in a dense setting.

@inproceedings{ide2023,  title     = "Japanese Lexical Complexity for Non-Native Readers: A New Dataset",  author    = "Ide, Yusuke and Mita, Masato and Nohejl, Adam and Ouchi, Hiroki and Watanabe, Taro",  booktitle = "Proceedings of the Eighteenth Workshop on Innovative Use of {NLP} for Building Educational Applications",  month     = July,  year      = 2023,  publisher = "Association for Computational Linguistics",}

MultiLS Framework

@article{north2024multils,  title={MultiLS: A Multi-task Lexical Simplification Framework},  author={North, Kai and Ranasinghe, Tharindu and Shardlow, Matthew and Zampieri, Marcos},  journal={arXiv preprint arXiv:2402.14972}, year={2024}}

MultiLS-SP/CA: Spanish and Catalan Datasets

@misc{bott2024multilsspca,      title={MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish},      author={Stefan Bott and Horacio Saggion and Nelson Peréz Rojas and Martin Solis Salazar and Saul Calderon Ramirez},      year={2024},      eprint={2404.07814},      archivePrefix={arXiv},      primaryClass={cs.CL}}

You may also be interested in Japanese lexical simplification datasets targeting native speakers (by different authors):

Controlled and Balanced Dataset for Japanese Lexical Simplification (Kodaira et al., 2016):dataset
Evaluation Dataset and System for Japanese Lexical Simplification (Kajiwara and Yamamoto, 2015:dataset

License

This work is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License. Please, citeour papers if you use the data.

See the sources and license information fortrial set andtest set for details.

About

MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.

Topics

japanese-language lcp cwi lexical-complexity lexical-simplification

Resources

Custom properties

Stars

Watchers

Forks

Report repository

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp