olsgaard/Japanese_nlp_scriptsPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star26

Small example scripts for working with Japanese texts in Python

License

View license

26 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
corpora_software_lit-review.md		corpora_software_lit-review.md
extract_ling.py		extract_ling.py
get_char_type.py		get_char_type.py
gutenberg_jp.txt		gutenberg_jp.txt
helpers.py		helpers.py
hiragana_katakana_translitteration.py		hiragana_katakana_translitteration.py
jp_regex.py		jp_regex.py
jp_regex_py2x.py		jp_regex_py2x.py
license.txt		license.txt
matplotlib_fontcheck.py		matplotlib_fontcheck.py
mecab_examples.py		mecab_examples.py
tinysegmenter.py		tinysegmenter.py
wikipedia_jp.txt		wikipedia_jp.txt

Repository files navigation

Japanese_nlp_scripts

Small example scripts for working with Japanese texts in Python.

jp_regex.py

This is a library of functions and variables that are helpful to have handy when manipulating Japanese text in python. This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.

It mainly stores regexes to find hiragana, katakana, kanji and other types of characters in a string, as well as some easy to use shortcut functions.

Regular expression unicode blocks collected fromMark Rogoyski.

hiragana_full = r'[ぁ-ゟ]'katakana_full = r'[゠-ヿ]'kanji = r'[㐀-䶵一-鿋豈-頻]'radicals = r'[⺀-⿕]'katakana_half_width = r'[｟-ﾟ]'alphanum_full = r'[！-～]'symbols_punct = r'[、-〿]'misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'ascii_char = r'[ -~]'

hiragana_katakana_translitteration.py

This is a quick script to make good hiragana <-> katakana transliteration in just 4 lines of Python.

If you don't need romaji translitteration and want to lower your scripts dependencies you canforgo pip installing some surprisingly large libraries just to convert from hiraganan to katakanaand simply copy paste the below 4 lines (and preferrably a link to my homepage or github) and youare good to go.

Tested in Python 3.x, doesn't seem to work in Python 2.7 download it off my githubhere

How it works

I use the builtin string functiontranslate which converts characters to corrosponding characters in a translations table, easily created with another string function,maketrans.See documentation here

We simply create our hiragana and katakana translation tables and use thestr.translate() function to do the heavy lifting.

I've usedMark Rogoyski list of hiragana and katakana unicode codepoints and removed characters I don't want transliterated. For example, I want to be able to convert コーヒ to hiragana and back. If I had naively used the table, thenー would be converted into゜, which wouldn't make any sense.

The magic happens in these 4 lines of code:

katakana_chart = "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヽヾ"hiragana_chart = "ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖゝゞ" hir2kat = str.maketrans(hiragana_chart, katakana_chart)kat2hir  =str.maketrans(katakana_chart, hiragana_chart)

And it is used like so:

mixed = 'きゃりーぱみゅぱみゅは日本の歌手です。'print(mixed.translate(hir2kat))# out: キャリーパミュパミュハ日本ノ歌手デス。# transliterate back and forthprint(mixed.translate(hir2kat).translate(kat2hir))# out: きゃりーぱみゅぱみゅは日本の歌手です。

Notice how kanji and special characters are left alone.

get_char_type.py

Function to determine the character class of a single character in a Japanes text.Distinguishes between 6 classes, OTHER, ROMAJI, HIRAGANA, KATAKANA, DIGIT, KANJI

These classes can be useful as features in a machine learning classifier.

matplotlib_fontcheck.py

If you are working with any kind of NLP in Python that involves Japanese, it is paramount to be able to view summary statics in the form of graphs that in one way or another includes Japanese characters.

On most systems the default font won't be able to show both kanji, kana and ascii and many fonts (at least on ubuntu) will only be able to show CJK script or alphabet, which is a real pain in the ass.

This script helps you to figure out which font to use.

On ubuntu the output to STDOUT should be the following:

Droid Sans /usr/share/fonts/truetype/droid/DroidSans.ttfVera /home/supermads/anaconda3/lib/python3.4/site-packages/matplotlib/mpl-data/fonts/ttf/Vera.ttfTakaoGothic /usr/share/fonts/truetype/takao-gothic/TakaoGothic.ttfTakaoPGothic /usr/share/fonts/truetype/takao-gothic/TakaoPGothic.ttfLiberation Sans /usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttfubuntu /usr/share/fonts/truetype/ubuntu-font-family/Ubuntu-R.ttfFreeSans /usr/share/fonts/truetype/freefont/FreeSans.ttfDroid Sans Japanese /usr/share/fonts/truetype/droid/DroidSansJapanese.ttfDejaVu Sans /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf

As you can see, I’m running Anaconda Python 3, and if Anaconda can’t find a font it will fallback into it’s own folder to load the Vera font.

It will also draw a plot in matplotlib, where you can see how the different fonts handle CJK-characters and alphabet.

I’ve found the simplest way of changing fonts in matplotlib to simply be usingmatplotlib.rc

import matplotlibmatplotlib.rc('font', family='Monospace')

About

Small example scripts for working with Japanese texts in Python

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Japanese_nlp_scripts

jp_regex.py

hiragana_katakana_translitteration.py

How it works

get_char_type.py

matplotlib_fontcheck.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

olsgaard/Japanese_nlp_scripts

Folders and files

Latest commit

History

Repository files navigation

Japanese_nlp_scripts

jp_regex.py

hiragana_katakana_translitteration.py

How it works

get_char_type.py

matplotlib_fontcheck.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages