Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve text normalize to keep original timestamps#264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
fondoger wants to merge5 commits intoremsky:master
base:master
Choose a base branch
Loading
fromfondoger:fondoger/fix-normalize

Conversation

@fondoger
Copy link

@fondogerfondoger commentedMar 30, 2025
edited
Loading

Currently, the text normalize algorithm will simply replace original text with normalized text. This behavior causes the generated timestamps not align with the original timestamps.

Kokoro supports embedding phonemes in the text, and the token timestamps is based on the original text.

  • Original Input Text:[Misaki](/misˈɑki/) is a G2P engine designed for [Kokoro](/kˈOkəɹO/) models.
  • Text For Timestamps:Misaki is a G2P engine designed for Kokoro models.

Before this PR:

Text:  The price will be $100 after 9:30PM.word    start_time      end_timeThe     0.0005416666666666625   0.07554166666666667price   0.07554166666666667     0.3880416666666666will    0.3880416666666666      0.4880416666666667be      0.4880416666666667      0.6380416666666666one     0.6380416666666666      0.8255416666666666hundred 0.8255416666666666      1.1255416666666667dollars 1.1255416666666667      1.8505416666666668after   1.8505416666666668      2.188041666666667nine    2.188041666666667       2.5255416666666664thirtyPM        2.5255416666666664      3.5255416666666664.       3.5255416666666664      3.6755416666666667

Note that$100 is mistakenly shown asone handred, and9:30PM is shown asnine thirtyPM

After this PR:

Text:  The price will be $100 after 9:30PM.word    start_time      end_timeThe     0.0005416666666666625   0.07554166666666667price   0.07554166666666667     0.3880416666666666will    0.3880416666666666      0.4880416666666667be      0.4880416666666667      0.6380416666666666$100    0.6380416666666666      1.8505416666666668after   1.8505416666666668      2.1880416666666679:30PM  2.188041666666667       3.5255416666666664.       3.5255416666666664      3.6755416666666667

Note that both the$100 and9:30PM is correct now.

@fondoger
Copy link
Author

fondoger commentedMar 30, 2025
edited
Loading

@remsky,@fireblade2534 Please review this PR. I tested it locally and the result is good.

@fireblade2534
Copy link
Collaborator

I can't test it out right now but ill test it out tmrw.

Copy link
Collaborator

@fireblade2534fireblade2534 left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This PR looks great in concept but there are a few issue texts that I want to highlight:

  • Running on localhost:7860 -> Running on [localhost:[7860](/sˈɛvənti ˈeɪt sˈɪksti/)](/lˈoʊkɐlhˌoʊst kˈoʊlən sˈɛvən θˈaʊzənd ˈeɪt hˈʌndɹɪd sˈɪksti/)
  • Email me atuser@example.com -> Email me at [user@[example-com](/ɛɡzˈæmpəl dˈɑːt kˈɑːm/)](/jˈuːzɚɹ æɾ ɛɡzˈæmpəl dˈɑːt kˈɑːm/)
  • Oh yeah I have $500.60 in my bank account -> Oh ye'a I have [$[500.60](/fˈaɪv hˈʌndɹɪd pˈɔɪnt sˈɪks zˈiəɹoʊ/)](/fˈaɪv hˈʌndɹɪd ænd wˈʌn dˈɑːlɚz ænd sˈɪksti sˈɛnts/) in my bank account

What happens with both of those (and will happen in more cases) is that it normalized for example localhost:7860 but since the text was still in [localhost:7860] the number normalizer came along and normalized the number. This is an inherent issue because of the way that the normalizer / you code work. The code does handle custom phonemes, see text_processor.py:handle_custom_phonemes and get_sentence_info.

@fondoger
Copy link
Author

Thanks for the review. I'll check if I can think of better solutions to handle these cases.

@fondoger
Copy link
Author

Just find out that the original Kokoro itself can already handle some basic normalizations.

Try it here:https://hexgrad-kokoro-tts.hf.space

  • Email me atuser@example.com -> ˈimˌAl mˌi æt jˈuzəɹ æt ɪɡzˈæmpəl dˌɑt kˈɑm
  • Oh yeah I have $500.60 in my bank account -> ˈO jˈɛə ˌI hæv fˈIv hˈʌndɹəd dˈɑləɹz ænd sˈɪksti sˈɛnts ɪn mI bˈæŋk əkˈWnt

Maybe we can simply disable normalizations in Kokoro Fast API.

@fireblade2534
Copy link
Collaborator

Disabling normalizations in kokoro-FastAPI has always been an option. The readme has a section on how to do it

@fireblade2534
Copy link
Collaborator

Thanks for the review. I'll check if I can think of better solutions to handle these cases.

I would suggest hijacking the current system for preserving custom phenomes

remsky reacted with thumbs up emoji

@fondogerfondoger marked this pull request as draftApril 3, 2025 06:39
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@fireblade2534fireblade2534fireblade2534 requested changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@fondoger@fireblade2534

[8]ページ先頭

©2009-2025 Movatter.jp