Movatterモバイル変換

Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speaker Deck

Speaker Deck

Programming Java by Voice

Breandan Considine

March 10, 2016

1.1k

Programming Java by Voice

Although high quality, general-purpose dictation is just barely outside our reach, modern speech recognition is well adapted to small-vocabulary, structured grammars like programming languages and voice user interfaces (VUIs). By providing alternative input mechanisms to traditional IDEs, we can improve accessibility for visually impaired programmers, and free developers from the paradigm of menu- and button- based navigation. In this presentation, we will demonstrate a tool that can navigate code, recognize simple commands, and help you write Java, just by listening to your voice. Written in Java and built on open source libraries, you too can integrate speech recognition in an IDE or desktop application of your choice, by using a few simple recipes. Join us to learn more!

Breandan Considine

March 10, 2016

More Decks by Breandan Considine

See All by Breandan Considine

Intrinsic social motivation via causal influence in multi-agent RL

0

120

Deep, Skinny Neural Networks are not Universal Approximators

0

120

PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

0

150

DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars

0

110

Idiolect: A Reconfigurable Voice Coding Assisant

0

220

Interactive Programming with Automated Reasoning

0

79

Learning Structural Edits via Incremental Tree Transformations

0

55

Thinking Like Transformers

0

89

Discriminative Embeddings of Latent Variable Models for Structured Data

0

60

Other Decks in Programming

See All in Programming

ボトムアップの生成AI活用を推進する社内AIエージェント開発

0

1.4k

Blazing Fast UI Development with Compose Hot Reload (Bangladesh KUG, October 2025)

2

440

Migration to Signals, Resource API, and NgRx Signal Store

0

140

CSC305 Lecture 11

0

320

テーブル定義書の構造化抽出して、生成AIでDWH分析を試してみた / devio2025tokyo

0

360

Google Opal解体新書

3

110

モテるデスク環境

3

1.4k

Software Architecture

6

2.4k

NIKKEI Tech Talk#38

0

350

オンデバイスAIとXcode

0

370

CSC305 Lecture 13

0

340

data-viz-talk-cz-2025

0

110

Featured

See All Featured

Optimizing for Happiness

379

70k

Unsuck your backbone

671

58k

Building Applications with DynamoDB

96

6.7k

Agile that works and the tools we love

331

21k

89

9.9k

463

140k

個人開発の失敗を避けるイケてる考え方 / tips for indie hackers

116

20k

Balancing Empowerment & Direction

5

710

Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End

253

22k

Refactoring Trust on Your Teams (GOTO; Chicago 2020)

35

3.2k

Navigating Team Friction

190

15k

4 Signs Your Business is Dying

186

22k

Transcript

Programming Java By Voice Breandan Considine EclipseCon 2016
Automatic speech recognition in 2011
None
Traditional Automatic Speech Recognition (ASR) • Requires lots of handmade
feature engineering • Poor results: >25% WER for HMM architectures
Xuedong Huang, James Baker, and Raj Reddy. A historical perspective
of speech recognition. Commun. ACM, 57(1):94– 103, January 2014.
What happened? • Bigger data • Faster hardware • Smarter
algorithms
State of the art ASR • < 10% average word
error on large datasets • Deep Neural Nets: RNNs, CNNs, RBMs, LSTM • Trained on 1k+ hours of transcribed speech • Takes time (days) and energy (kWh) to train • Difficult to adapt without prior experience
FOSS Speech Recognition • Deep learning libraries • C/C++: Caffe,
Kaldi • Python: Theano, Caffe • Lua: Torch • Java: dl4j, H2O • Open source datasets • LibriSpeech – 1000 hours of LibriVox audiobooks • Experience is required
Let’s think for a moment… • What if speech recognition
were perfect? • ASR is just a fancy input method • How can ASR improve user productivity? • What are the user’s expectations? • Behavior is predictable and deterministic • Control interface is simple and intuitive • Recognition is fast and accurate
Online Speech Recognition • Google, Nuance, AT&T, WIT.ai/Facebook, IBM Watson
curl -X POST \ --header 'Content-Type: audio/x-flac; rate=44100;' \ --data-binary @speech.flac \ 'https://www.google.com/speech-api/v2/ recognize?lang=en-us&key=<KEY>'
Why offline? • Latency – many applications need fast local
recognition • Mobility – users do not always have an internet connection • Privacy – data is recorded and analyzed completely offline • Flexibility – configurable API, language, vocabulary, grammar
Introduction • What techniques do modern ASR systems use? •
How do I build a speech recognition application? • Is speech recognition accessible to developers? • What libraries and frameworks exist for speech?
None
Maven Dependencies <dependency> <groupId>edu.cmu.sphinx</groupId> <artifactId>sphinx4-core</artifactId> <version>1.0-SNAPSHOT</version> </dependency> <dependency> <groupId>edu.cmu.sphinx</groupId> <artifactId>sphinx4-data</artifactId>
<version>1.0-SNAPSHOT</version> </dependency>
• Recording in 16kHz, 16-bit depth, mono, single channel •
16,000 samples per second at 16-bit depth = 32KBps
• Typically 13 features per sample (MFCC or PLP) •
Contains delta- and delta-delta features
None
None
None
None
None
Step 1. Acoustic Model • Acoustic model training is very
time consuming (months) • Pretrained models are available for many languages config.setAcousticModelPath("resource:<directory>");
Step 2. Phonetic Model • Mapping phonemes to words •
Word error rate increases with size • Pronunciation aided by g2p labeling • CMU Sphinx has tools to generate dictionaries config.setDictionaryPath("resource:<language>.dict");
Step 2. Phonetic Model autonomous AO T AA N AH
M AH S autonomously AO T AA N OW M AH S L IY autonomy AO T AA N AH M IY autonomy(2) AH T AA N AH M IY autopacific AO T OW P AH S IH F IH K autopart AO T OW P AA R T autoparts AO T OW P AA R T S autopilot AO T OW P AY L AH T
3a. Language Model 3b. Grammar Model • Need ~100k sentences
• Some tools: • Logios (model generation) • lmtool (CMU Sphinx) • IRSLM • MITLM • Appropriate for transcription, voice typing • More rigid structure • Suitable for commands • Much smaller state space • Competitive with DNN accuracy for small vocabularies • Easy to configure for UX
Step 3a. Language Model <s> generally cloudy today with scattered
outbreaks of rain and drizzle persistent and heavy at times </s> <s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s> <s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s> <s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be light and patchy but heavier rain may develop in the west later </s>
None
None
Step 3b. Grammar Model • JSpeech Grammar Format <size> =
/10/ small | /2/ medium | /1/ large; <color> = /0.5/ red | /0.1/ blue | /0.2/ green; <action> = please (/20/save files |/1/delete files); <place> = /20/ <city> | /5/ <country>; public command = <size> | <color> | <action> | <place> config.setGrammarPath("resource:<grammar>.gram");
Step 3b: Grammar Format public <number> = <hundreds> | <tens>
| <teens> | <ones>; <hundreds> = <ones> hundred [<tens> | <teens> | <ones>]; <tens> = ( twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety ) [<ones>]; <teens> = ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen; <ones> = one | two | three | four | five | six | seven | eight | nine;
Configuring Sphinx-4 Configuration config = new Configuration(); config.setAcousticModelPath(AM_PATH); config.setDictionaryPath(DICT_PATH); config.setLanguageModelPath(LM_PATH);
config.setGrammarPath(GRAMMAR_PATH); // config.setSampleRate(8000);
Live Speech Recognizer LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(config); recognizer.startRecognition(true); …
recognizer.stopRecognition();
Live Speech Recognizer while (…) { // This blocks on
a recognition result SpeechResult sr = recognizer.getResult(); String h = sr.getHypothesis(); Collection<String> hs = sr.getNbest(3); … }
Stream Speech Recognizer StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration); recognizer.startRecognition( new
FileInputStream("speech.wav")); SpeechResult result = recognizer.getResult(); recognizer.stopRecognition();
Improving recognition accuracy • Using context-dependent cues • Structuring commands
to reduce phonetic similarity • Disabling the microphone • Grammar swapping
None
public <refactor_m> = override methods | implement methods | delegate
methods | generate | surround with | unwrap | comment | …
Grammar Swapping public static void swapGrammar(String new) throws PropertyException, InstantiationException,
IOException { Linguist l = (Linguist) cm.lookup("flatLinguist"); linguist.deallocate(); cm.setProperty("jsgfGrammar", "oldGram", new); linguist.allocate(); }
Step 4. Audio User Interface • Important mechanism for accessibility
• Communication via text-to-speech and audio feedback • Short cues prompt an action and announce a result • Provides a familiar feedback mechanism for users • Playing audio usually blocks speech recognition
MaryTTS: Initializing maryTTS = new LocalMaryInterface(); Locale systemLocale = Locale.getDefault();
if (maryTTS.getAvailableLocales() .contains(systemLocale)) { voice = Voice.getDefaultVoice(systemLocale); } maryTTS.setLocale(voice.getLocale()); maryTTS.setVoice(voice.getName());
MaryTTS: Generating Speech try { AudioInputStream audio = mary.generateAudio(text); AudioPlayer
player = new AudioPlayer(audio); player.start(); player.join(); } catch (SynthesisException | InterruptedException e) { … }
Resources • CMUSphinx, http://cmusphinx.sourceforge.net/wiki/ • MaryTTS, http://mary.dfki.de/ • JSpeech Grammar
Format, http://www.w3.org/TR/jsgf/ • LibriSpeech ASR Corpus http://www.openslr.org/12/ • ARPA format for N-gram backoff (Doug Paul) http://www.speech.sri.com/projects/srilm/manpages/ngram -format.5.html • LM Tool, http://www.speech.cs.cmu.edu/tools/lmtool.html
Further Research • Accurate and Compact Large Vocabulary Speech Recognition
on Mobile Devices, research.google.com/pubs/archive/41176.pdf • Comparing Open-Source Speech Recognition Toolkits, http://suendermann.com/su/pdf/oasis2014.pdf • Tuning Sphinx to Outperform Google's Speech Recognition API, http://suendermann.com/su/pdf/essv2014.pdf • Deep Neural Networks for Acoustic Modeling in Speech Recognition, research.google.com/pubs/archive/38131.pdf • Deep Speech: Scaling up end-to-end speech recognition, http://arxiv.org/pdf/1412.5567v2.pdf
Online Resources • WER progress: https://github.com/syhw/wer_are_we • Kaldi Speech Recognition
Library http://kaldi-asr.org/doc/ • J.A.R.V.I.S (API) https://github.com/lkuza2/java-speech-api • OpenEars http://www.politepix.com/openears • PocketSphinx https://github.com/cmusphinx/pocketsphinx • AT&T Speech http://developer.att.com/apis/speech/docs • https://www.chromium.org/developers/how-tos/api-keys
Special Thanks https://github.com/breandan/idear • Alexey Kudinkin (@alexeykudinkin) • Yaroslav Lepenkin
(@lepenkinya) • CMU Sphinx (@cmuspeechgroup)
Evaluate the Sessions Sign in and vote at eclipsecon.org -
1 + 1 0

[8]ページ先頭

©2009-2025 Movatter.jp