Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

License

NotificationsYou must be signed in to change notification settings

sarulab-speech/jtubespeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.

Description

data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

videoidautosubchannelid
00017RsBbUHkTrueTrueUCTW2tw0Mhho72MojB1L48IQ
100PqfZgibocFalseTrueUCzoghTgl4dvIW9GZF6UC-BA
---------------

  • lang: Language ID (ja [Japanese], en [English], ...)
  • YYYYMM: Year and month when we collect data
  • videoid: YouTube video ID. Its YouTube page ishttps://www.youtube.com/watch?v={videoid}.
  • auto: The video has an automatic subtitle or not.
  • sub: The video has a manual (i.e., human-generated) subtitle or not.
  • channelid: YouTube Channel ID. Its YouTube page ishttps://www.youtube.com/channel/{channelid}.

Statistics

langfilename (data/)#videos-sub-true#videos-auto-true
jaja/202103.csv110,000 (10,000 hours)4,960,000
enen/202108_middle.csv739543667555
en/202108_tiny.csv7422765570
ruru/202203_middle.csv258222349388
ru/202108_tiny.csv3989046061
dede/202203_middle.csv194468527993
de/202108_tiny.csv3072766954
frfr/202203_middle.csv164261524261
fr/202108_tiny.csv2537170466
arar/202203_middle.csv158568311697
ar/202108_tiny.csv3199342649
thth/202203_middle.csv154416250417
th/202108_tiny.csv4088626907
trtr/202203_middle.csv154213494187
tr/202108_tiny.csv2731768079
hihi/202203_middle.csv132175172565
hi/202108_tiny.csv3403431439
zhzh/202108_middle.csv12627123387
zh/202108_tiny.csv6312623387
idid/202203_middle.csv105334447836
id/202108_tiny.csv1808672760
elel/202203_middle.csv96436156445
el/202108_tiny.csv2594726735
ptpt/202203_middle.csv90600436425
pt/202108_tiny.csv1169248974
dada/202203_middle.csv86027421190
da/202108_tiny.csv1877962094
bnbn/202203_middle.csv75371303335
bn/202108_tiny.csv1631557112
fifi/202203_middle.csv68571347307
fi/202108_tiny.csv1556150626
tata/202203_middle.csv6692389209
ta/202108_tiny.csv2186026120
huhu/202203_middle.csv64792351426
hu/202108_tiny.csv1315449237
ukuk/202203_middle.csv55098283741
uk/202108_tiny.csv910336392
fafa/202203_middle.csv54165203794
fa/202108_tiny.csv1048224102
urur/202203_middle.csv47426177232
ur/202108_tiny.csv1091726503
azaz/202203_middle.csv42906272895
az/202108_tiny.csv1118852025
tete/202203_middle.csv41478110521
te/202108_tiny.csv1192924444
kaka/202203_middle.csv38199158179
ka/202108_tiny.csv1039523914
mlml/202203_middle.csv35477249624
ml/202108_tiny.csv908042359
bebe/202203_middle.csv33935227854
be/202108_tiny.csv762237739
isis/202203_middle.csv32272159506
is/202108_tiny.csv1063238268
kkkk/202203_middle.csv26021148230
kk/202108_tiny.csv691726163
gaga/202203_middle.csv22177131863
ga/202108_tiny.csv905851411
kyky/202203_middle.csv20583150884
ky/202108_tiny.csv724142027
tgtg/202203_middle.csv15451135276
tg/202108_tiny.csv549140244

Contributors

Scripts for data collection

scripts/*.py are scripts for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages.youtube-dl and ffmpeg are required.

step1: making search words

The scriptscripts/make_search_word.py downloads the wikipedia dump file and finds words for searching videos.{lang} is the language code, e.g.,ja (Japanese) anden (English).

$ python scripts/make_search_word.py {lang}

step2: obtaining video IDs

The scriptscripts/obtain_video_id.py obtains YouTube video IDs by searching by words.{filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g.,{filename_word_list}) and run them in parallel.

$ python scripts/obtain_video_id.py {lang} {filename_word_list}

step3: checking if subtitles are available

The scriptscripts/retrieve_subtitle_exists.py retrieves whether the video has subtitles or not.{filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python scripts/retrieve_subtitle_exists.py {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The scriptscripts/download_video.py downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved invideo/{lang}/wav16k andvideo/{lang}/txt, respectively.

$ python scripts/download_video.py {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio.The scriptscripts/align.py aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python scripts/align.py {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments filesegments.txt and a log filesegments.log in the output directory.Using the segments file, bad utterances or audio files can be sorted-out:

min_confidence_score=-0.3awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt

step5 (ASV): speaker variation scoring

There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The scriptscripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

Reference

  • coming soon

Link

Update

  • Aug. 2021: first update ({lang}/*_tiny.csv)
  • Jan. 2022: add mid-size data ({lang}/*_middile.csv)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors7

Languages


[8]ページ先頭

©2009-2025 Movatter.jp