Movatterモバイル変換


[0]ホーム

URL:


Upgrade to Pro — share decks privately, control downloads, hide ads and more …
Speaker DeckSpeaker Deck
Speaker Deck

Programming Java by Voice

Avatar for Breandan Considine Breandan Considine
March 10, 2016

Programming Java by Voice

Although high quality, general-purpose dictation is just barely outside our reach, modern speech recognition is well adapted to small-vocabulary, structured grammars like programming languages and voice user interfaces (VUIs). By providing alternative input mechanisms to traditional IDEs, we can improve accessibility for visually impaired programmers, and free developers from the paradigm of menu- and button- based navigation. In this presentation, we will demonstrate a tool that can navigate code, recognize simple commands, and help you write Java, just by listening to your voice. Written in Java and built on open source libraries, you too can integrate speech recognition in an IDE or desktop application of your choice, by using a few simple recipes. Join us to learn more!

Avatar for Breandan Considine

Breandan Considine

March 10, 2016
Tweet

More Decks by Breandan Considine

See All by Breandan Considine

Other Decks in Programming

See All in Programming

Featured

See All Featured

Transcript

  1. Programming Java By Voice Breandan Considine EclipseCon 2016

  2. Automatic speech recognition in 2011

  3. None
  4. Traditional Automatic Speech Recognition (ASR) • Requires lots of handmade

    feature engineering • Poor results: >25% WER for HMM architectures
  5. Xuedong Huang, James Baker, and Raj Reddy. A historical perspective

    of speech recognition. Commun. ACM, 57(1):94– 103, January 2014.
  6. What happened? • Bigger data • Faster hardware • Smarter

    algorithms
  7. State of the art ASR • < 10% average word

    error on large datasets • Deep Neural Nets: RNNs, CNNs, RBMs, LSTM • Trained on 1k+ hours of transcribed speech • Takes time (days) and energy (kWh) to train • Difficult to adapt without prior experience
  8. FOSS Speech Recognition • Deep learning libraries • C/C++: Caffe,

    Kaldi • Python: Theano, Caffe • Lua: Torch • Java: dl4j, H2O • Open source datasets • LibriSpeech – 1000 hours of LibriVox audiobooks • Experience is required
  9. Let’s think for a moment… • What if speech recognition

    were perfect? • ASR is just a fancy input method • How can ASR improve user productivity? • What are the user’s expectations? • Behavior is predictable and deterministic • Control interface is simple and intuitive • Recognition is fast and accurate
  10. Online Speech Recognition • Google, Nuance, AT&T, WIT.ai/Facebook, IBM Watson

    curl -X POST \ --header 'Content-Type: audio/x-flac; rate=44100;' \ --data-binary @speech.flac \ 'https://www.google.com/speech-api/v2/ recognize?lang=en-us&key=<KEY>'
  11. Why offline? • Latency – many applications need fast local

    recognition • Mobility – users do not always have an internet connection • Privacy – data is recorded and analyzed completely offline • Flexibility – configurable API, language, vocabulary, grammar
  12. Introduction • What techniques do modern ASR systems use? •

    How do I build a speech recognition application? • Is speech recognition accessible to developers? • What libraries and frameworks exist for speech?
  13. None
  14. Maven Dependencies <dependency> <groupId>edu.cmu.sphinx</groupId> <artifactId>sphinx4-core</artifactId> <version>1.0-SNAPSHOT</version> </dependency> <dependency> <groupId>edu.cmu.sphinx</groupId> <artifactId>sphinx4-data</artifactId>

    <version>1.0-SNAPSHOT</version> </dependency>
  15. • Recording in 16kHz, 16-bit depth, mono, single channel •

    16,000 samples per second at 16-bit depth = 32KBps
  16. • Typically 13 features per sample (MFCC or PLP) •

    Contains delta- and delta-delta features
  17. None
  18. None
  19. None
  20. None
  21. None
  22. Step 1. Acoustic Model • Acoustic model training is very

    time consuming (months) • Pretrained models are available for many languages config.setAcousticModelPath("resource:<directory>");
  23. Step 2. Phonetic Model • Mapping phonemes to words •

    Word error rate increases with size • Pronunciation aided by g2p labeling • CMU Sphinx has tools to generate dictionaries config.setDictionaryPath("resource:<language>.dict");
  24. Step 2. Phonetic Model autonomous AO T AA N AH

    M AH S autonomously AO T AA N OW M AH S L IY autonomy AO T AA N AH M IY autonomy(2) AH T AA N AH M IY autopacific AO T OW P AH S IH F IH K autopart AO T OW P AA R T autoparts AO T OW P AA R T S autopilot AO T OW P AY L AH T
  25. 3a. Language Model 3b. Grammar Model • Need ~100k sentences

    • Some tools: • Logios (model generation) • lmtool (CMU Sphinx) • IRSLM • MITLM • Appropriate for transcription, voice typing • More rigid structure • Suitable for commands • Much smaller state space • Competitive with DNN accuracy for small vocabularies • Easy to configure for UX
  26. Step 3a. Language Model <s> generally cloudy today with scattered

    outbreaks of rain and drizzle persistent and heavy at times </s> <s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s> <s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s> <s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be light and patchy but heavier rain may develop in the west later </s>
  27. None
  28. None
  29. Step 3b. Grammar Model • JSpeech Grammar Format <size> =

    /10/ small | /2/ medium | /1/ large; <color> = /0.5/ red | /0.1/ blue | /0.2/ green; <action> = please (/20/save files |/1/delete files); <place> = /20/ <city> | /5/ <country>; public command = <size> | <color> | <action> | <place> config.setGrammarPath("resource:<grammar>.gram");
  30. Step 3b: Grammar Format public <number> = <hundreds> | <tens>

    | <teens> | <ones>; <hundreds> = <ones> hundred [<tens> | <teens> | <ones>]; <tens> = ( twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety ) [<ones>]; <teens> = ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen; <ones> = one | two | three | four | five | six | seven | eight | nine;
  31. Configuring Sphinx-4 Configuration config = new Configuration(); config.setAcousticModelPath(AM_PATH); config.setDictionaryPath(DICT_PATH); config.setLanguageModelPath(LM_PATH);

    config.setGrammarPath(GRAMMAR_PATH); // config.setSampleRate(8000);
  32. Live Speech Recognizer LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(config); recognizer.startRecognition(true); …

    recognizer.stopRecognition();
  33. Live Speech Recognizer while (…) { // This blocks on

    a recognition result SpeechResult sr = recognizer.getResult(); String h = sr.getHypothesis(); Collection<String> hs = sr.getNbest(3); … }
  34. Stream Speech Recognizer StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration); recognizer.startRecognition( new

    FileInputStream("speech.wav")); SpeechResult result = recognizer.getResult(); recognizer.stopRecognition();
  35. Improving recognition accuracy • Using context-dependent cues • Structuring commands

    to reduce phonetic similarity • Disabling the microphone • Grammar swapping
  36. None
  37. public <refactor_m> = override methods | implement methods | delegate

    methods | generate | surround with | unwrap | comment | …
  38. Grammar Swapping public static void swapGrammar(String new) throws PropertyException, InstantiationException,

    IOException { Linguist l = (Linguist) cm.lookup("flatLinguist"); linguist.deallocate(); cm.setProperty("jsgfGrammar", "oldGram", new); linguist.allocate(); }
  39. Step 4. Audio User Interface • Important mechanism for accessibility

    • Communication via text-to-speech and audio feedback • Short cues prompt an action and announce a result • Provides a familiar feedback mechanism for users • Playing audio usually blocks speech recognition
  40. MaryTTS: Initializing maryTTS = new LocalMaryInterface(); Locale systemLocale = Locale.getDefault();

    if (maryTTS.getAvailableLocales() .contains(systemLocale)) { voice = Voice.getDefaultVoice(systemLocale); } maryTTS.setLocale(voice.getLocale()); maryTTS.setVoice(voice.getName());
  41. MaryTTS: Generating Speech try { AudioInputStream audio = mary.generateAudio(text); AudioPlayer

    player = new AudioPlayer(audio); player.start(); player.join(); } catch (SynthesisException | InterruptedException e) { … }
  42. Resources • CMUSphinx, http://cmusphinx.sourceforge.net/wiki/ • MaryTTS, http://mary.dfki.de/ • JSpeech Grammar

    Format, http://www.w3.org/TR/jsgf/ • LibriSpeech ASR Corpus http://www.openslr.org/12/ • ARPA format for N-gram backoff (Doug Paul) http://www.speech.sri.com/projects/srilm/manpages/ngram -format.5.html • LM Tool, http://www.speech.cs.cmu.edu/tools/lmtool.html
  43. Further Research • Accurate and Compact Large Vocabulary Speech Recognition

    on Mobile Devices, research.google.com/pubs/archive/41176.pdf • Comparing Open-Source Speech Recognition Toolkits, http://suendermann.com/su/pdf/oasis2014.pdf • Tuning Sphinx to Outperform Google's Speech Recognition API, http://suendermann.com/su/pdf/essv2014.pdf • Deep Neural Networks for Acoustic Modeling in Speech Recognition, research.google.com/pubs/archive/38131.pdf • Deep Speech: Scaling up end-to-end speech recognition, http://arxiv.org/pdf/1412.5567v2.pdf
  44. Online Resources • WER progress: https://github.com/syhw/wer_are_we • Kaldi Speech Recognition

    Library http://kaldi-asr.org/doc/ • J.A.R.V.I.S (API) https://github.com/lkuza2/java-speech-api • OpenEars http://www.politepix.com/openears • PocketSphinx https://github.com/cmusphinx/pocketsphinx • AT&T Speech http://developer.att.com/apis/speech/docs • https://www.chromium.org/developers/how-tos/api-keys
  45. Special Thanks https://github.com/breandan/idear • Alexey Kudinkin (@alexeykudinkin) • Yaroslav Lepenkin

    (@lepenkinya) • CMU Sphinx (@cmuspeechgroup)
  46. Evaluate the Sessions Sign in and vote at eclipsecon.org -

    1 + 1 0

[8]ページ先頭

©2009-2025 Movatter.jp