rhasspy/gruutPublic

NotificationsYou must be signed in to change notification settings
Fork43
Star320

A tokenizer, text cleaner, and phonemizer for many human languages.

License

MIT license

320 stars 43 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 544 Commits
bin		bin
data		data
docs		docs
etc		etc
gruut-lang-ar		gruut-lang-ar
gruut-lang-ca		gruut-lang-ca
gruut-lang-cs		gruut-lang-cs
gruut-lang-de		gruut-lang-de
gruut-lang-en		gruut-lang-en
gruut-lang-es		gruut-lang-es
gruut-lang-fa		gruut-lang-fa
gruut-lang-fr		gruut-lang-fr
gruut-lang-it		gruut-lang-it
gruut-lang-lb		gruut-lang-lb
gruut-lang-nl		gruut-lang-nl
gruut-lang-pt		gruut-lang-pt
gruut-lang-ru		gruut-lang-ru
gruut-lang-sv		gruut-lang-sv
gruut-lang-sw		gruut-lang-sw
gruut		gruut
scripts		scripts
test		test
tests		tests
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.nojekyll		.nojekyll
.projectile		.projectile
.pylintrc		.pylintrc
CHANGELOG		CHANGELOG
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
example.py		example.py
mypy.ini		mypy.ini
pylintrc		pylintrc
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
requirements_test.txt		requirements_test.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

Gruut

A tokenizer, text cleaner, andIPA phonemizer for several human languages that supportsSSML.

fromgruutimportsentencestext='He wound it around the wound, saying "I read it was $10 to read."'forsentinsentences(text,lang="en-us"):forwordinsent:ifword.phonemes:print(word.text,*word.phonemes)

which outputs:

He h ˈiwound w ˈaʊ n dit ˈɪ taround ɚ ˈaʊ n dthe ð əwound w ˈu n d, |saying s ˈeɪ ɪ ŋI ˈaɪread ɹ ˈɛ dit ˈɪ twas w ə zten t ˈɛ ndollars d ˈɑ l ɚ zto t əread ɹ ˈi d. ‖

Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.

Asubset of SSML is also supported:

fromgruutimportsentencesssml_text="""<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"    xml:lang="en-US"><s>Today at 4pm, 2/1/2000.</s><s xml:lang="it">Un mese fà, 2/1/2000.</s></speak>"""forsentinsentences(ssml_text,ssml=True):forwordinsent:ifword.phonemes:print(sent.idx,word.lang,word.text,*word.phonemes)

with the output:

0 en-US Today t ə d ˈeɪ0 en-US at ˈæ t0 en-US four f ˈɔ ɹ0 en-US P p ˈi0 en-US M ˈɛ m0 en-US , |0 en-US February f ˈɛ b j u ˌɛ ɹ i0 en-US first f ˈɚ s t0 en-US , |0 en-US two t ˈu0 en-US thousand θ ˈaʊ z ə n d0 en-US . ‖1 it Un u n1 it mese ˈm e s e1 it fà f a1 it , |1 it due d j u1 it gennaio d͡ʒ e n n ˈa j o1 it duemila d u e ˈm i l a1 it . ‖

Seethe documentation for more details.

Installation

pip install gruut

Languages besides English can be added during installation. For example, with French and Italian support:

pip install -f'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]

The extra pip repo is needed for an updatednum2words fork that includes support for more languages.

You may alsomanually download language files and use put them in$XDG_CONFIG_HOME/gruut/ ($HOME/.config/gruut by default).

gruut will look for language files in the directory$XDG_CONFIG_HOME/gruut/<lang>/ if the corresponding Python package is not installed. Note that<lang> here is thefull language name, e.g.de-de instead of justde.

Supported Languages

gruut currently supports:

Arabic (ar)
Czech (cs orcs-cz)
German (de orde-de)
English (en oren-us)
Spanish (es ores-es)
Farsi/Persian (fa)
French (fr orfr-fr)
Italian (it orit-it)
Luxembourgish (lb)
Dutch (nl)
Russian (ru orru-ru)
Swedish (sv orsv-se)
Swahili (sw)

The goal is to support all ofvoice2json's languages

Dependencies

Python 3.7 or higher
Linux
- Tested on Debian Bullseye
num2words fork andBabel
- Currency/number handling
- num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
gruut-ipa
- IPA pronunciation manipulation
pycrfsuite
- Part of speech tagging and grapheme to phoneme models
pydateparser
- Date parsing for multiple languages

Numbers, Dates, and More

gruut can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g.,<s lang="...">).

The following types of expressions can be automatically expanded into words bygruut:

Numbers - "123" to "one hundred and twenty three" (disable withverbalize_numbers=False or--no-numbers)
- Relies onBabel for parsing andnum2words for verbalization
Dates - "1/1/2020" to "January first, twenty twenty" (disable withverbalize_dates=False or--no-dates)
- Relies onpydateparser for parsing and bothBabel andnum2words for verbalization
Currency - "$10" to "ten dollars" (disable withverbalize_currency=False or--no-currency)
- Relies onBabel for parsing and bothBabel andnum2words for verbalization
Times - "12:01am" to "twelve oh one A M" (disable withverbalize_times=False or--no-times)
- English only
- Relies onnum2words for verbalization

Command-Line Usage

Thegruut module can be executed withpython3 -m gruut --language <LANGUAGE> <TEXT> or with thegruut command (fromsetup.py).

Thegruut command is line-oriented, consuming text and producingJSONL.You will probably want to installjq to manipulate theJSONL output fromgruut.

Plain Text

Takes raw text and outputsJSONL with cleaned words/tokens.

echo'This, right here, is some "RAW" text!' \| gruut --language en-us \| jq --raw-output'.words[].text'This,righthere,issome"RAW"text!

More information is available in the full JSON output:

gruut --language en-us'More  text.'| jq.

Output:

{"idx":0,"text":"More text.","text_with_ws":"More text.","text_spoken":"More text","par_idx":0,"lang":"en-us","voice":"","words": [    {"idx":0,"text":"More","text_with_ws":"More","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":"JJR","phonemes": ["m","ˈɔ","ɹ"      ],"is_major_break":false,"is_minor_break":false,"is_punctuation":false,"is_break":false,"is_spoken":true,"pause_before_ms":0,"pause_after_ms":0    },    {"idx":1,"text":"text","text_with_ws":"text","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":"NN","phonemes": ["t","ˈɛ","k","s","t"      ],"is_major_break":false,"is_minor_break":false,"is_punctuation":false,"is_break":false,"is_spoken":true,"pause_before_ms":0,"pause_after_ms":0    },    {"idx":2,"text":".","text_with_ws":".","leading_ws":"","training_ws":"","sent_idx":0,"par_idx":0,"lang":"en-us","voice":"","pos":null,"phonemes": ["‖"      ],"is_major_break":true,"is_minor_break":false,"is_punctuation":false,"is_break":true,"is_spoken":false,"pause_before_ms":0,"pause_after_ms":0    }  ],"pause_before_ms":0,"pause_after_ms":0}

For the whole input line and each word, thetext property contains the processed input text with normalized whitespace whiletext_with_ws retains the original whitespace. Thetext_spoken property only contains words that are spoken, so punctuation and breaks are excluded.

Within each word, there is:

idx - zero-based index of the word in the sentence
sent_idx - zero-based index of the sentence in the input text
pos - part of speech tag (if available)
phonemes - list ofIPA phonemes for the word (if available)
is_minor_break -true if "word" separates phrases (comma, semicolon, etc.)
is_major_break -true if "word" separates sentences (period, question mark, etc.)
is_break -true if "word" is a major or minor break
is_punctuation -true if "word" is a surrounding punctuation mark (quote, bracket, etc.)
is_spoken -true if not a break or punctuation

Seepython3 -m gruut <LANGUAGE> --help for more options.

SSML

A subset ofSSML is supported:

<speak> - wrap around SSML text
- lang - set language for document
<p> - paragraph
- lang - set language for paragraph
<s> - sentence (disables automatic sentence breaking)
- lang - set language for sentence
<w> /<token> - word (disables automatic tokenization)
- lang - set language for word
- role - set word role (seeword roles)
<lang lang="..."> - set language inner text
<voice name="..."> - set voice of inner text
<say-as interpret-as=""> - force interpretation of inner text
- interpret-as one of "spell-out", "date", "number", "time", or "currency"
- format - way to format text depending oninterpret-as
  - number - one of "cardinal", "ordinal", "digits", "year"
  - date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time=""> - Pause for given amount of time
- time - seconds ("123s") or milliseconds ("123ms")
<mark name=""> - User-defined mark (marks_before andmarks_after attributes of words/sentences)
- name - name of mark
<sub alias=""> - substitutealias for inner text
<phoneme ph="..."> - supply phonemes for inner text
- ph - phonemes for each word of inner text, separated by whitespace
<lexicon> - inline or external pronunciation lexicon
- id - unique id of lexicon (used in<lookup ref="...">)
- uri - if empty or missing, lexicon is inline
- One or more<lexeme> child elements with:
  - Optionalrole="..." ([word roles][#word-roles] separated by whitespace)
  - <grapheme>WORD</grapheme> - word text
  - <phoneme>P H O N E M E S</phoneme> - word pronunciation (phonemes separated by whitespace)
<lookup ref="..."> - use pronunciation lexicon for child elements
- ref - id from a<lexicon>

Word Roles

During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag asgruut:<TAG>. For initialisms andspell-out, the rolegruut:letter is used to indicate that e.g., "a" should be spoken as/eɪ/ instead of/ə/.

Foren-us, the following additional roles are available from the part-of-speech tagger:

gruut:CD - number
gruut:DT - determiner
gruut:IN - preposition or subordinating conjunction
gruut:JJ - adjective
gruut:NN - noun
gruut:PRP - personal pronoun
gruut:RB - adverb
gruut:VB - verb
gruut:VB - verb (past tense)

Inline Lexicons

Inlinepronunciation lexicons are supported via the<lexicon> and<lookup> tags. gruut diverges slightly from theSSML standard here by allowing lexicons to be defined within the SSML document itself (url is blank or missing). Additionally, theid attribute of the<lexicon> element can be left off to indicate a "default" inline lexicon that does not require a corresponding<lookup> tag.

For example, the following document will yield three different pronunciations for the word "tomato":

<?xml version="1.0"?><speakversion="1.1"xmlns="http://www.w3.org/2001/10/synthesis"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"xml:lang="en-US">  <lexiconxml:id="test"alphabet="ipa">    <lexeme>      <grapheme>        tomato      </grapheme>      <phoneme><!-- Individual phonemes are separated by whitespace-->        t ə m ˈɑ t oʊ      </phoneme>    </lexeme>    <lexeme>      <graphemerole="fake-role">        tomato      </grapheme>      <phoneme><!-- Made up pronunciation for fake word role-->        t ə m ˈi t oʊ      </phoneme>    </lexeme>  </lexicon>  <w>tomato</w>  <lookupref="test">    <w>tomato</w>    <wrole="fake-role">tomato</w>  </lookup></speak>

The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/). Within the<lookup> tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has arole attached (selecting a made up pronunciation in this case).

Even further from the SSML standard, gruut allows you to leave off the<lexicon> id entirely. With noid, a<lookup> tag is no longer needed, allowing you to override the pronunciation of any word in the document:

<?xml version="1.0"?><speakversion="1.1"xmlns="http://www.w3.org/2001/10/synthesis"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"xml:lang="en-US"><!-- No id means change all words without a lookup-->  <lexicon>    <lexeme>      <grapheme>        tomato      </grapheme>      <phoneme>        t ə m ˈɑ t oʊ      </phoneme>    </lexeme>  </lexicon>  <w>tomato</w></speak>

This will yield a pronunciation of/t ə m ˈɑ t oʊ/ for all instances of "tomato" in the document (unless they have a<lookup>).

Intended Audience

gruut is useful for transforming raw text into phonetic pronunciations, similar tophonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from acarefully chosen inventory.

For each supported language, gruut includes a:

A word pronunciation lexicon built from open source data
- Seepron_dict
A pre-trained grapheme-to-phoneme model for guessing word pronunciations

Some languages also include:

A pre-trained part of speech tagger built from open source data:
- Seeuniversal dependencies

About

A tokenizer, text cleaner, and phonemizer for many human languages.

Releases15

v2.4.0 Latest

Jul 3, 2024

+ 14 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Gruut

Installation

Supported Languages

Dependencies

Numbers, Dates, and More

Command-Line Usage

Plain Text

SSML

Word Roles

Inline Lexicons

Intended Audience

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases15

Packages

Contributors10

Uh oh!

Languages

Movatterモバイル変換

License

rhasspy/gruut

Folders and files

Latest commit

History

Repository files navigation

Gruut

Installation

Supported Languages

Dependencies

Numbers, Dates, and More

Command-Line Usage

Plain Text

SSML

Word Roles

Inline Lexicons

Intended Audience

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases15

Packages0

Contributors10

Uh oh!

Languages

Packages