Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A tool for automatic English to Katakana conversion

License

NotificationsYou must be signed in to change notification settings

Patchethium/e2k

Repository files navigation

PyPI version

e2k is a Python library that translates English to Katakana. It's based on a RNN model trained on a dictionary extracted from Wikitionary and JMdict / EDICT. It only requiresnumpy as a dependency.

We also provide a English to Katakana dictionary in the releases (not available in the PyPI package).

Usage

e2k is available on PyPI.

pip install e2k

Usage:

2 types of models are provided, one converts phoneme to Katakana and one that converts character to Katakana. Choose the one that fits your use case.

frome2kimportP2K,C2Kfromg2p_enimportG2p# any g2p library with CMUdict will work# cmudict phoneme to katakanap2k=P2K()g2p=G2p()word="vordhosbn"# track 2 from Aphex Twin's "Drukqs"word=wordkatakana=p2k(g2p(word))print(katakana)# "ボードヒッチン"# characters directly to katakanac2k=C2K()katakana=c2k(word)print(katakana)# "ボードホスン"# we provide top_k and top_p decoding strategieskatakana=c2k(word,"top_k",k=5)# top_k samplingkatakana=c2k(word,"top_p",p=0.9,t=2)# top_p sampling# see https://huggingface.co/docs/transformers/en/generation_strategies# for more details# you can check the accepted symbols usingin_table=c2k.in_table# `c2k` accepts lowercase characters, space and apostrophein_table=p2k.in_table# `p2k` accepts phonemes from the CMUdict and space# for output symbolsout_table=c2k.out_tableout_table=p2k.out_table

Pitch Accent Prediction

We also provide an RNN model for pitch accent prediction. It's trained on about 700k entries fromUnidic. You can use it independently for any katakana sequences.

frome2kimportAccentPredictorasApfrome2kimportC2Kc2k=C2K()ap=Ap()word="geogaddi"katakana=c2k(word)accent=ap(katakana)print(f"Katakana:{katakana}, Accent:{accent}")# Katakana: ジオガディ, Accent: 3# you can also check its in-tablein_table=ap.in_table# it's katakana without special tokens

N-Gram Model

We also provide an N-Gram model to check if an English word suits for pronounciation (for example, not a short-hand word likeMVP orUSSR). In such case you may want to spell it as-is.

frome2kimportNGramngram=NGram()defisvalid(word)valid=ngram(word)print(f"Word: {word}, {"Valid" if ngram(word) else "Invalid"}")isvalid("ussr")# invalidisvalid("doggy")# valid# we also provide an util function to spell as-isword="ussr"print(ngram.as_is(word))# ユーエスエスアール# A common practice is to spell the word when valid and spell as-is when invalid.# The example below will print# `ユーエスエスアール` for `ussr` instead of `アサー`,# `ドギー` for `doggy` instead of `ディーオージージーワイ`ifngram(word):print(ngram.as_is(word))else:print(c2k(word))in_table=ngram.in_table# check the in_table# you can also check the raw scorescore=ngram.score(word)# negative value, higher the better

Warning

For any symbols not in thein_table, the model will ignore them and may produce unexpected results.

Note

The model will lower the input word automatically.

Performance

Katakana Prediction

We evaluate BLEU score on 10% of the dataset.

ModelBLEU Score ↑
P2K0.89
C2K0.92

Accent Prediction

We evaluate accuracy on 10% of the dataset.

ModelAccuracy ↑
Default88.4%

N-Gram Model

I don't know how to evaluate the n-gram model.

Katakana Dictionary

We train the model on a dictionary extracted fromWikitionary andJMdict / EDICT. The dictionary contains 30k entries, you can also find it in the releases.

Note

The dictionary is not included in the PyPI package. Either download it from the releases or create it yourself following the instructions below.

Dependencies

Theextraction script has zero dependencies, as long as you have a Python 3 interpreter it should work.

However, it's not included in the PyPI package, you need to clone this repository to use it.

git clone https://github.com/Patchethium/e2k.git

Download data

Wikitionary

Download the raw dump of the Japanese Wikitionary fromhttps://kaikki.org/dictionary/rawdata.html, they kindly provide the parsed data in a JSONL format.

Look for theJapanese ja-extract.jsonl.gz (compressed 37.5MB) entry and download it. If you prefer command line, use

curl -O https://kaikki.org/dictionary/downloads/ja/ja-extract.jsonl.gz

JMdict / EDICT

Download theJMdict andEDICT fromhttps://www.edrdg.org/wiki/index.php/JMdict-EDICT_Dictionary_Project.

Look for theedict2.gz and download it. Or in command line:

curl -O http://ftp.edrdg.org/pub/Nihongo/edict2.gz

Extract both files into/vendor folder.

On Linux, you can use

gzip -d ja-extract.jsonl.gzgzip -d edict2.gz

Run the extraction

python extract.py# if you have another name for the filepython extract.py --path /path/to/your_file.jsonl

By default, akatakana_dict.jsonl file will be created in thevendor folder.

Accent Dictionary

Go toUnidic's homepage and look for entryunidic-mecab-2.1.2_src.zip and download.

Or in commandline,wget https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip.

Extraction

Extract the zip file and place thelex.csv intovendor.

Training

To train the accent predictor, simply run

python accent.py

N-Gram Model

Downloadcmudict, orwget https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict.

Training

python ngram.py

Development

Install the dependencies

I useuv to manage the dependencies and publish the package.

uv sync

Then activate the virtual environment withsource .venv/bin/activate or adduv run before the commands.

Benchmark

The scores inPerformance are obtained using theeval.py script.

# --p2k for phoneme to katakana, if not provided, it will be character to katakanapython eval.py --data ./vendor/katakana_dict.jsonl --model /path/to/your/model.pth --p2k

Train

After installing the dependencies,torch will be added as a development dependency. You can train the model using

python train.py --data ./vendor/katakana_dict.jsonl

It takes around 10 minutes on a desktop CPU. The model will be saved asvendor/model-{p2k/c2k}-e{epoch}.pth.

Also, you'll need to either download thekatakana_dict.jsonl from the releases or create it yourself using theextract.py script.

CUDA

Thecpu version is capable to train this little model, if you prefer to use GPU, use--extra to install thetorch with CUDA support,

uv sync --extra cu124# oruv sync --extra cu121

depending on your CUDA version.

Export

The model should be exported tonumpy format for production use.

# --p2k for phoneme to katakana, if not provided, it will be character to katakana# --fp32 for double precision, by default we use fp16 to save space# --output to specify the output file, in this project it's `model-{p2k/c2k}.npz`# --safetenors to use safe tensors, it's for easier binding in some languages# --accent to extract accent predictor, in this project the model name is `accent.npz`python export.py --model /path/to/your/model.pth --p2k --output /path/to/your/model.npz

Note

The pretrained weights are not included in the Git registry, you can find them in the releases.

License

  • The code is released Unlicenced (Public Domain).
  • The dictionary follows theWikimedia's license and theJMdict / EDICT's Copyright license.
    • In short, they both fall into CC-BY-SA.
    • The model weights are trained using the dictionary. I am not a lawyer, whether the machine learning weights is considered as a derivative work is up to you.
  • The accent predictor model is trained using Unidic, it can be used under the GPLv2.0/LGPLv2.1/Modified BSD at your choice. Seetheir page for further information.
  • The n-gram model is trained using CMUDict withBSD 2-Clause Licence.

Credits

About

A tool for automatic English to Katakana conversion

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp