Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

The Business Scene Dialogue corpus

License

NotificationsYou must be signed in to change notification settings

tsuruoka-lab/BSD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

©2020, The University of Tokyo

Updates

November 10, 2021: Further fix for the speaker information.
November 2, 2021: The data are updated by fixing incorrect speaker information and some misspellings in the conversation text.

Corpus Description

The Japanese-English business conversation corpus, namely Business Scene Dialogue (BSD) corpus, was constructed in 3 steps: 1) selecting business scenes, 2) writing monolingual conversation scenarios according to the selected scenes, and 3) translating the scenarios into the other language. Half of the monolingual scenarios were written in Japanese and the other half were written in English. The whole construction process was supervised by a person who satisfies the following conditions to guarantee the conversations to be natural:

  • has the experience of being engaged in language learning programs, especially for business conversations
  • is able to smoothly communicate with others in various business scenes both in Japanese and English
  • has the experience of being involved in business

We provide balanced training, development and evaluation splits from BSD corpus. The documents in these sets are balanced in terms of scenes and original languages. In this repository we publicly share the full development and evaluation sets and a part of the training data set.

TrainingDevelopmentEvaluation
Sentences20,0002,0512,120
Scenarios6706969

Corpus Statistics

Data SetSceneScenariosSentencesScenariosSentences
JA-ENEN-JA
TrainingFace-to-face12235251032986
Phone call681944752175
General chatting611915721883
Meeting561964581787
Training1256219463
Presentation660718189
Total32510,00034510,000
DevelopmentFace-to-face1131912314
Phone call61767185
General chatting72238248
Meeting72407219
Training140123
Presentation131133
Total34997351054
EvaluationFace-to-face1238111345
Phone call61637212
General chatting72118212
Meeting72287229
Training138130
Presentation131140
Total341052351068

Corpus Structure

The corpus is structured in json format consisting of documents, which consist of sentence pairs. Each sentence pair has a sentence number, speaker name in English and Japanese, text in English and Japanese, original language, scene of the scenario (tag), and title of the scenario (title).

[    {"id":"190315_E001_17","tag":"training","title":"Training: How to do research","original_language":"en","conversation": [            {"no":1,"en_speaker":"Mr. Ben Sherman","ja_speaker":"ベン シャーマンさん","en_sentence":"I will be teaching you how to conduct research today.","ja_sentence":"今日は調査の進め方についてトレーニングします。"          },...      ]      },...]

License

Our dataset is released under theCreative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license.

Reference

If you use this dataset, please cite the following paper:Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa (2019). "Designing the Business Conversation Corpus." In Proceedings of the 6th Workshop on Asian Translation, 2019.

@inproceedings{rikters-etal-2019-designing,title ="Designing the Business Conversation Corpus",author ="Rikters, Mat{\=\i}ss  and      Ri, Ryokan  and      Li, Tong  and      Nakazawa, Toshiaki",booktitle ="Proceedings of the 6th Workshop on Asian Translation",month = nov,year ="2019",address ="Hong Kong, China",publisher ="Association for Computational Linguistics",url ="https://www.aclweb.org/anthology/D19-5204",doi ="10.18653/v1/D19-5204",pages ="54--61"}

Acknowledgements

This work was supported by "Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation", the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN.

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp