ku-nlp/speechBSDPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star3

An extension of the BSD corpus with audio and speaker attribute information

License

View license

3 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
txt		txt
LICENSE.md		LICENSE.md
README.md		README.md

Repository files navigation

SpeechBSD Dataset

This is an extension of theBSD corpus with audio files and speaker attribute information.

Download

To download from this repository:

git clone https://github.com/ku-nlp/speechBSD.gitcd speechBSDwget https://lotus.kuee.kyoto-u.ac.jp/~sshimizu/data/speechBSD_wav_20221026.tar.gztar zxvf speechBSD_wav_20221026.tar.gz

You can also download it viahuggingface.

Statistics

	Train	Dev.	Test
Scenarios	670	69	69
Sentences	20,000	2,051	2,120
En audio (h)	20.1	2.1	2.1
Ja audio (h)	25.3	2.7	2.7
En audio gender (male % / female %)	47.2 / 52.8	50.1 / 49.9	44.4 / 55.6
Ja audio gender (male % / female %)	68.0 / 32.0	62.3 / 37.7	69.0 / 31.0

Structure

wav directory contains wav files (16 kHz, mono channel), which are classified totrain,dev, andtest.
txt directory contains json files split totrain,dev, andtest.
- Each json file is a list of scenarios.
- Each scenario contains:
  - id,tag,title, andoriginal_language, which are identical to the ones of the BSD corpus, andconversation
- conversation is a list of utterances. Each utterance contains:
  - no,ja_speaker,en_speaker,ja_sentence,en_sentence that are identical to the ones of the BSD corpus
  - ja_spkid anden_spkid that show speaker IDs consistent throughout the conversation
  - ja_wav anden_wav that show the wavfile names located in thewav diretory
  - ja_spk_gender anden_spk_gender that show the gender of the speaker of the corresponding wav file
  - ja_spk_prefecture anden_spk_state that show where the speaker come from

[    {        "id": "190315_E001_17",        "tag": "training",        "title": "Training: How to do research",        "original_language": "en",        "conversation": [            {                "no": 1,                "en_speaker": "Mr. Ben Sherman",                "ja_speaker": "ベン シャーマンさん",                "en_sentence": "I will be teaching you how to conduct research today.",                "ja_sentence": "今日は調査の進め方についてトレーニングします。",                "ja_spkid": "190315_E001_17_spk0_ja",                "en_spkid": "190315_E001_17_spk0_en",                "ja_wav": "190315_E001_17_spk0_no1_ja.wav",                "en_wav": "190315_E001_17_spk0_no1_en.wav",                "ja_spk_gender": "M",                "en_spk_gender": "M",                "ja_spk_prefecture": "大阪",                "en_spk_state": "CA"            },

Notes

Gender is one of "M" or "F".
Speakers are different if speaker ID is different.For example, if a conversation is spoken by two speakers taking turns, there would be 4 speakers (2 Japanese speakers and 2 English speakers).However, it's possible that speakers with different speaker ID is actually spoken by the same person because of the way audio is collected.
Gender information of audio does not necessarily match with the one inferrable from text.For example, even if theen_speaker is "Mr. Sam Lee", the audio may contain female voice.This is because no explicit gender information is given in the original BSD corpus.
Japanese speech is collected from Japanese speakers who are from Japan.
- ja_spk_prefecture is one of the 47 prefectures or "不明" (unknown).
  - Prefectures that ends with "県" or "府" does not contain those characters (e.g., "神奈川", "京都").
  - Tokyo is "東京" without "都".
  - Hokkaido is "北海道".
English speech is collected from English speakers who are from the US.
- en_spk is one of the 50 states, written in postal abbreviation.

Citation

If you find the dataset useful, please cite our ACL 2023 Findings paper:Towards Speech Dialogue Translation Mediating Speakers of Different Languages.

@inproceedings{shimizu-etal-2023-towards,    title = "Towards Speech Dialogue Translation Mediating Speakers of Different Languages",    author = "Shimizu, Shuichiro  and      Chu, Chenhui  and      Li, Sheng  and      Kurohashi, Sadao",    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",    month = jul,    year = "2023",    address = "Toronto, Canada",    publisher = "Association for Computational Linguistics",    url = "https://aclanthology.org/2023.findings-acl.72",    pages = "1122--1134",    abstract = "We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech translation experiments using Whisper and mBART, and show that bilingual context performs better in our settings.",}

License

This dataset is licensed underCC-BY-NC-SA 4.0.

About

An extension of the BSD corpus with audio and speaker attribute information

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SpeechBSD Dataset

Download

Statistics

Structure

Notes

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Movatterモバイル変換

License

ku-nlp/speechBSD

Folders and files

Latest commit

History

Repository files navigation

SpeechBSD Dataset

Download

Statistics

Structure

Notes

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Packages