- Notifications
You must be signed in to change notification settings - Fork41
A tokenizer, text cleaner, and phonemizer for many human languages.
License
rhasspy/gruut
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A tokenizer, text cleaner, andIPA phonemizer for several human languages that supportsSSML.
fromgruutimportsentencestext='He wound it around the wound, saying "I read it was $10 to read."'forsentinsentences(text,lang="en-us"):forwordinsent:ifword.phonemes:print(word.text,*word.phonemes)
which outputs:
He h ˈiwound w ˈaʊ n dit ˈɪ taround ɚ ˈaʊ n dthe ð əwound w ˈu n d, |saying s ˈeɪ ɪ ŋI ˈaɪread ɹ ˈɛ dit ˈɪ twas w ə zten t ˈɛ ndollars d ˈɑ l ɚ zto t əread ɹ ˈi d. ‖
Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
Asubset of SSML is also supported:
fromgruutimportsentencesssml_text="""<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"><s>Today at 4pm, 2/1/2000.</s><s xml:lang="it">Un mese fà, 2/1/2000.</s></speak>"""forsentinsentences(ssml_text,ssml=True):forwordinsent:ifword.phonemes:print(sent.idx,word.lang,word.text,*word.phonemes)
with the output:
0 en-US Today t ə d ˈeɪ0 en-US at ˈæ t0 en-US four f ˈɔ ɹ0 en-US P p ˈi0 en-US M ˈɛ m0 en-US , |0 en-US February f ˈɛ b j u ˌɛ ɹ i0 en-US first f ˈɚ s t0 en-US , |0 en-US two t ˈu0 en-US thousand θ ˈaʊ z ə n d0 en-US . ‖1 it Un u n1 it mese ˈm e s e1 it fà f a1 it , |1 it due d j u1 it gennaio d͡ʒ e n n ˈa j o1 it duemila d u e ˈm i l a1 it . ‖
Seethe documentation for more details.
pip install gruut
Languages besides English can be added during installation. For example, with French and Italian support:
pip install -f'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
The extra pip repo is needed for an updatednum2words fork that includes support for more languages.
You may alsomanually download language files and use put them in$XDG_CONFIG_HOME/gruut/
($HOME/.config/gruut
by default).
gruut will look for language files in the directory$XDG_CONFIG_HOME/gruut/<lang>/
if the corresponding Python package is not installed. Note that<lang>
here is thefull language name, e.g.de-de
instead of justde
.
gruut currently supports:
- Arabic (
ar
) - Czech (
cs
orcs-cz
) - German (
de
orde-de
) - English (
en
oren-us
) - Spanish (
es
ores-es
) - Farsi/Persian (
fa
) - French (
fr
orfr-fr
) - Italian (
it
orit-it
) - Luxembourgish (
lb
) - Dutch (
nl
) - Russian (
ru
orru-ru
) - Swedish (
sv
orsv-se
) - Swahili (
sw
)
The goal is to support all ofvoice2json's languages
- Python 3.7 or higher
- Linux
- Tested on Debian Bullseye
- num2words fork andBabel
- Currency/number handling
- num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
- pydateparser
- Date parsing for multiple languages
gruut
can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g.,<s lang="...">
).
The following types of expressions can be automatically expanded into words bygruut
:
- Numbers - "123" to "one hundred and twenty three" (disable with
verbalize_numbers=False
or--no-numbers
)- Relies on
Babel
for parsing andnum2words
for verbalization
- Relies on
- Dates - "1/1/2020" to "January first, twenty twenty" (disable with
verbalize_dates=False
or--no-dates
)- Relies on
pydateparser
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Currency - "$10" to "ten dollars" (disable with
verbalize_currency=False
or--no-currency
)- Relies on
Babel
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Times - "12:01am" to "twelve oh one A M" (disable with
verbalize_times=False
or--no-times
)- English only
- Relies on
num2words
for verbalization
Thegruut
module can be executed withpython3 -m gruut --language <LANGUAGE> <TEXT>
or with thegruut
command (fromsetup.py
).
Thegruut
command is line-oriented, consuming text and producingJSONL.You will probably want to installjq to manipulate theJSONL output fromgruut
.
Takes raw text and outputsJSONL with cleaned words/tokens.
echo'This, right here, is some "RAW" text!' \| gruut --language en-us \| jq --raw-output'.words[].text'This,righthere,issome"RAW"text!
More information is available in the full JSON output:
gruut --language en-us'More text.'| jq.
Output:
{"idx":0,"text":"More text.","text_with_ws":"More text.","text_spoken":"More text","par_idx":0,"lang":"en-us","voice":"","words": [ {"idx":0,"text":"More","text_with_ws":"More","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":"JJR","phonemes": ["m","ˈɔ","ɹ" ],"is_major_break":false,"is_minor_break":false,"is_punctuation":false,"is_break":false,"is_spoken":true,"pause_before_ms":0,"pause_after_ms":0 }, {"idx":1,"text":"text","text_with_ws":"text","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":"NN","phonemes": ["t","ˈɛ","k","s","t" ],"is_major_break":false,"is_minor_break":false,"is_punctuation":false,"is_break":false,"is_spoken":true,"pause_before_ms":0,"pause_after_ms":0 }, {"idx":2,"text":".","text_with_ws":".","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":null,"phonemes": ["‖" ],"is_major_break":true,"is_minor_break":false,"is_punctuation":false,"is_break":true,"is_spoken":false,"pause_before_ms":0,"pause_after_ms":0 } ],"pause_before_ms":0,"pause_after_ms":0}
For the whole input line and each word, thetext
property contains the processed input text with normalized whitespace whiletext_with_ws
retains the original whitespace. Thetext_spoken
property only contains words that are spoken, so punctuation and breaks are excluded.
Within each word, there is:
idx
- zero-based index of the word in the sentencesent_idx
- zero-based index of the sentence in the input textpos
- part of speech tag (if available)phonemes
- list ofIPA phonemes for the word (if available)is_minor_break
-true
if "word" separates phrases (comma, semicolon, etc.)is_major_break
-true
if "word" separates sentences (period, question mark, etc.)is_break
-true
if "word" is a major or minor breakis_punctuation
-true
if "word" is a surrounding punctuation mark (quote, bracket, etc.)is_spoken
-true
if not a break or punctuation
Seepython3 -m gruut <LANGUAGE> --help
for more options.
A subset ofSSML is supported:
<speak>
- wrap around SSML textlang
- set language for document
<p>
- paragraphlang
- set language for paragraph
<s>
- sentence (disables automatic sentence breaking)lang
- set language for sentence
<w>
/<token>
- word (disables automatic tokenization)lang
- set language for wordrole
- set word role (seeword roles)
<lang lang="...">
- set language inner text<voice name="...">
- set voice of inner text<say-as interpret-as="">
- force interpretation of inner textinterpret-as
one of "spell-out", "date", "number", "time", or "currency"format
- way to format text depending oninterpret-as
- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time="">
- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
<mark name="">
- User-defined mark (marks_before
andmarks_after
attributes of words/sentences)- name - name of mark
<sub alias="">
- substitutealias
for inner text<phoneme ph="...">
- supply phonemes for inner textph
- phonemes for each word of inner text, separated by whitespace
<lexicon>
- inline or external pronunciation lexiconid
- unique id of lexicon (used in<lookup ref="...">
)uri
- if empty or missing, lexicon is inline- One or more
<lexeme>
child elements with:- Optional
role="..."
([word roles][#word-roles] separated by whitespace) <grapheme>WORD</grapheme>
- word text<phoneme>P H O N E M E S</phoneme>
- word pronunciation (phonemes separated by whitespace)
- Optional
<lookup ref="...">
- use pronunciation lexicon for child elementsref
- id from a<lexicon>
During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag asgruut:<TAG>
. For initialisms andspell-out
, the rolegruut:letter
is used to indicate that e.g., "a" should be spoken as/eɪ/
instead of/ə/
.
Foren-us
, the following additional roles are available from the part-of-speech tagger:
gruut:CD
- numbergruut:DT
- determinergruut:IN
- preposition or subordinating conjunctiongruut:JJ
- adjectivegruut:NN
- noungruut:PRP
- personal pronoungruut:RB
- adverbgruut:VB
- verbgruut:VB
- verb (past tense)
Inlinepronunciation lexicons are supported via the<lexicon>
and<lookup>
tags. gruut diverges slightly from theSSML standard here by allowing lexicons to be defined within the SSML document itself (url
is blank or missing). Additionally, theid
attribute of the<lexicon>
element can be left off to indicate a "default" inline lexicon that does not require a corresponding<lookup>
tag.
For example, the following document will yield three different pronunciations for the word "tomato":
<?xml version="1.0"?><speakversion="1.1"xmlns="http://www.w3.org/2001/10/synthesis"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"xml:lang="en-US"> <lexiconxml:id="test"alphabet="ipa"> <lexeme> <grapheme> tomato </grapheme> <phoneme><!-- Individual phonemes are separated by whitespace--> t ə m ˈɑ t oʊ </phoneme> </lexeme> <lexeme> <graphemerole="fake-role"> tomato </grapheme> <phoneme><!-- Made up pronunciation for fake word role--> t ə m ˈi t oʊ </phoneme> </lexeme> </lexicon> <w>tomato</w> <lookupref="test"> <w>tomato</w> <wrole="fake-role">tomato</w> </lookup></speak>
The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/
). Within the<lookup>
tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has arole attached (selecting a made up pronunciation in this case).
Even further from the SSML standard, gruut allows you to leave off the<lexicon>
id entirely. With noid
, a<lookup>
tag is no longer needed, allowing you to override the pronunciation of any word in the document:
<?xml version="1.0"?><speakversion="1.1"xmlns="http://www.w3.org/2001/10/synthesis"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"xml:lang="en-US"><!-- No id means change all words without a lookup--> <lexicon> <lexeme> <grapheme> tomato </grapheme> <phoneme> t ə m ˈɑ t oʊ </phoneme> </lexeme> </lexicon> <w>tomato</w></speak>
This will yield a pronunciation of/t ə m ˈɑ t oʊ/
for all instances of "tomato" in the document (unless they have a<lookup>
).
gruut is useful for transforming raw text into phonetic pronunciations, similar tophonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from acarefully chosen inventory.
For each supported language, gruut includes a:
- A word pronunciation lexicon built from open source data
- Seepron_dict
- A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
- A pre-trained part of speech tagger built from open source data:
About
A tokenizer, text cleaner, and phonemizer for many human languages.