- Notifications
You must be signed in to change notification settings - Fork1
verypluming/JSICK
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
JSICK is the Japanese NLI and STS dataset by manually translating the English datasetSICK (Marelli et al., 2014) into Japanese.We hope that our dataset will be useful in research for realizing more advanced models that are capable of appropriately performing multilingual compositional inference.You can use the JSICK dataset/JSICK-stress test set athuggingface dataset!
The dataset is splitted intotrain.tsv
andtest.tsv
.
Name | Description |
---|---|
pair_ID | ids (the same with oriinal SICK) |
sentence_A_En | first sentence in English |
sentence_B_En | second sentence in English |
entailment_label_En | original entailment label in English |
relatedness_score_En | original relatedness score in the range [1-5] in English |
corr_entailment_labelAB_En | corrected entailment label from A to B in English by(Karouli et al., 2017) |
corr_entailment_labelBA_En | corrected entailment label from B to A in English by(Karouli et al., 2017) |
sentence_A_En | first sentence in Japanese |
sentence_B_En | second sentence in Japanese |
entailment_label_Ja | entailment label in Japanese |
relatedness_score_Ja | relatedness score in the range [1-5] in Japanese |
image_ID | original image in8K ImageFlickr dataset |
original_caption | original caption in8K ImageFlickr dataset |
semtag_short | linguistic phenomena tags in Japanese |
semtag_long | details of linguistic phenomena tags in Japanese |
The JSICK-stress test set is a dataset to investigate whether models capture word order and case particles in Japanese.The JSICK-stress test set is provided by transforming syntactic structures of sentence pairs in JSICK, where we analyze whether models are attentive to word order and case particles to predict entailment labels and similarity scores.The JSICK test set contains 1666, 797, and 1006 sentence pairs (A, B) whose premise sentences A (the columnsentence_A_Ja_origin
) include the basic word order involving ga-o(nominative-accusative), ga-ni (nominative-dative), and ga-de (nominative-instrumental/locative) relations, respectively.We provide the JSICK-stress test set by transforming syntactic structures of these pairs by the following three ways:
scrum_ga_o
: a scrambled pair, where the word order of premise sentences A is scrambled into o-ga, ni-ga, and de-ga order, respectively.ex_ga_o
: a rephrased pair, where the only case particles (ga, o, ni, de) in the premise A are swappeddel_ga_o
: a rephrased pair, where the only case particles (ga, o, ni) in the premise A are deleted
The filejsick/jsick-all-annotations.tsv
contains the JSICK raw annotations, and the filejsick-stress/jsick-stress-all-annotations.tsv
is a subset of JSICK-stress test sets annotated with human judgements.
- Hitomi Yanaka, Koji Mineshima.Compositional Evaluation on Japanese Textual Entailment and Similarity.Transactions of the Association for Computational Linguistics, 2022. (TACL2022)[arXiv]
- 谷中瞳, 峯島宏次.JSICK: 日本語構成的推論・類似度データセットの構築. 人工知能学会第35回全国大会, 2021.
If you use this dataset in any published research, please cite the following:
@article{yanaka-mineshima-2022-compositional, title = "Compositional Evaluation on {J}apanese Textual Entailment and Similarity", author = "Yanaka, Hitomi and Mineshima, Koji", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.73", doi = "10.1162/tacl_a_00518", pages = "1266--1284",}
This work is licensed under aCreative Commons Attribution 4.0 International License.
About
Repository for JSICK
Resources
License
Uh oh!
There was an error while loading.Please reload this page.