- Notifications
You must be signed in to change notification settings - Fork47
License
sarulab-speech/jtubespeech
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.
data/{lang}/{YYYYMM}.csv
lists as follows. See step4 for download.
videoid | auto | sub | channelid | |
---|---|---|---|---|
0 | 0017RsBbUHk | True | True | UCTW2tw0Mhho72MojB1L48IQ |
1 | 00PqfZgiboc | False | True | UCzoghTgl4dvIW9GZF6UC-BA |
--- | --- | --- | --- | --- |
lang
: Language ID (ja [Japanese], en [English], ...)YYYYMM
: Year and month when we collect datavideoid
: YouTube video ID. Its YouTube page ishttps://www.youtube.com/watch?v={videoid}
.auto
: The video has an automatic subtitle or not.sub
: The video has a manual (i.e., human-generated) subtitle or not.channelid
: YouTube Channel ID. Its YouTube page ishttps://www.youtube.com/channel/{channelid}
.
lang | filename (data/) | #videos-sub-true | #videos-auto-true |
---|---|---|---|
ja | ja/202103.csv | 110,000 (10,000 hours) | 4,960,000 |
en | en/202108_middle.csv | 739543 | 667555 |
en/202108_tiny.csv | 74227 | 65570 | |
ru | ru/202203_middle.csv | 258222 | 349388 |
ru/202108_tiny.csv | 39890 | 46061 | |
de | de/202203_middle.csv | 194468 | 527993 |
de/202108_tiny.csv | 30727 | 66954 | |
fr | fr/202203_middle.csv | 164261 | 524261 |
fr/202108_tiny.csv | 25371 | 70466 | |
ar | ar/202203_middle.csv | 158568 | 311697 |
ar/202108_tiny.csv | 31993 | 42649 | |
th | th/202203_middle.csv | 154416 | 250417 |
th/202108_tiny.csv | 40886 | 26907 | |
tr | tr/202203_middle.csv | 154213 | 494187 |
tr/202108_tiny.csv | 27317 | 68079 | |
hi | hi/202203_middle.csv | 132175 | 172565 |
hi/202108_tiny.csv | 34034 | 31439 | |
zh | zh/202108_middle.csv | 126271 | 23387 |
zh/202108_tiny.csv | 63126 | 23387 | |
id | id/202203_middle.csv | 105334 | 447836 |
id/202108_tiny.csv | 18086 | 72760 | |
el | el/202203_middle.csv | 96436 | 156445 |
el/202108_tiny.csv | 25947 | 26735 | |
pt | pt/202203_middle.csv | 90600 | 436425 |
pt/202108_tiny.csv | 11692 | 48974 | |
da | da/202203_middle.csv | 86027 | 421190 |
da/202108_tiny.csv | 18779 | 62094 | |
bn | bn/202203_middle.csv | 75371 | 303335 |
bn/202108_tiny.csv | 16315 | 57112 | |
fi | fi/202203_middle.csv | 68571 | 347307 |
fi/202108_tiny.csv | 15561 | 50626 | |
ta | ta/202203_middle.csv | 66923 | 89209 |
ta/202108_tiny.csv | 21860 | 26120 | |
hu | hu/202203_middle.csv | 64792 | 351426 |
hu/202108_tiny.csv | 13154 | 49237 | |
uk | uk/202203_middle.csv | 55098 | 283741 |
uk/202108_tiny.csv | 9103 | 36392 | |
fa | fa/202203_middle.csv | 54165 | 203794 |
fa/202108_tiny.csv | 10482 | 24102 | |
ur | ur/202203_middle.csv | 47426 | 177232 |
ur/202108_tiny.csv | 10917 | 26503 | |
az | az/202203_middle.csv | 42906 | 272895 |
az/202108_tiny.csv | 11188 | 52025 | |
te | te/202203_middle.csv | 41478 | 110521 |
te/202108_tiny.csv | 11929 | 24444 | |
ka | ka/202203_middle.csv | 38199 | 158179 |
ka/202108_tiny.csv | 10395 | 23914 | |
ml | ml/202203_middle.csv | 35477 | 249624 |
ml/202108_tiny.csv | 9080 | 42359 | |
be | be/202203_middle.csv | 33935 | 227854 |
be/202108_tiny.csv | 7622 | 37739 | |
is | is/202203_middle.csv | 32272 | 159506 |
is/202108_tiny.csv | 10632 | 38268 | |
kk | kk/202203_middle.csv | 26021 | 148230 |
kk/202108_tiny.csv | 6917 | 26163 | |
ga | ga/202203_middle.csv | 22177 | 131863 |
ga/202108_tiny.csv | 9058 | 51411 | |
ky | ky/202203_middle.csv | 20583 | 150884 |
ky/202108_tiny.csv | 7241 | 42027 | |
tg | tg/202203_middle.csv | 15451 | 135276 |
tg/202108_tiny.csv | 5491 | 40244 |
- Shinnosuke Takamichi (The University of Tokyo, Japan) [main contributor]
- Ludwig Kürzinger (Technical University of Munich, Germany)
- Takaaki Saeki (The University of Tokyo, Japan)
- Sayaka Shiota (Tokyo Metropolitan University, Japan)
- Shinji Watanabe (Carnegie Mellon University, USA)
scripts/*.py
are scripts for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages.youtube-dl and ffmpeg are required.
The scriptscripts/make_search_word.py
downloads the wikipedia dump file and finds words for searching videos.{lang}
is the language code, e.g.,ja
(Japanese) anden
(English).
$ python scripts/make_search_word.py {lang}
The scriptscripts/obtain_video_id.py
obtains YouTube video IDs by searching by words.{filename_word_list}
is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g.,{filename_word_list}
) and run them in parallel.
$ python scripts/obtain_video_id.py {lang} {filename_word_list}
The scriptscripts/retrieve_subtitle_exists.py
retrieves whether the video has subtitles or not.{filename_videoid_list}
is a videoID list file made in step2. This process will make a CSV file.
$ python scripts/retrieve_subtitle_exists.py {lang} {filename_videoid_list}
The scriptscripts/download_video.py
downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list}
is a subtitle list file made in step3. The audio and subtitles will be saved invideo/{lang}/wav16k
andvideo/{lang}/txt
, respectively.
$ python scripts/download_video.py {lang} {filename_subtitle_list}
Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio.The scriptscripts/align.py
aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:
$ python scripts/align.py {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}
The result is written into a segments filesegments.txt
and a log filesegments.log
in the output directory.Using the segments file, bad utterances or audio files can be sorted-out:
min_confidence_score=-0.3awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt
There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The scriptscripts/xxx.py
obtains scores of speaker variation within a video to classify videos into three types.
$ python scripts/xxx.py
- coming soon
- Aug. 2021: first update (
{lang}/*_tiny.csv
) - Jan. 2022: add mid-size data (
{lang}/*_middile.csv
)
About
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors7
Uh oh!
There was an error while loading.Please reload this page.