- Notifications
You must be signed in to change notification settings - Fork0
Japanese text normalizer that resolves spelling inconsistencies.
License
sea-turt1e/yurenizer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Japanese text normalizer that resolves spelling inconsistencies.
Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md
yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow theSudachi Synonym Dictionary.
You can try the web-based demo here.yurenizer Web-demo
stopped the publication of Demo.
pip install yurenizer
curl -L -o synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt
fromyurenizerimportSynonymNormalizer,NormalizerConfignormalizer=SynonymNormalizer(synonym_file_path="synonyms.txt")text="「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"print(normalizer.normalize(text))# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。
You can control normalization by specifyingNormalizerConfig
as an argument to the normalize function.
fromyurenizerimportSynonymNormalizer,NormalizerConfignormalizer=SynonymNormalizer(synonym_file_path="synonyms.txt")text="「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"config=NormalizerConfig(taigen=True,yougen=False,expansion="from_another",unify_level="lexeme",other_language=False,alias=False,old_name=False,misuse=False,alphabetic_abbreviation=True,# Normalize only alphabetic abbreviationsnon_alphabetic_abbreviation=False,alphabet=False,orthographic_variation=False,misspelling=False )print(f"Output:{normalizer.normalize(text,config)}")# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます
The settings inyurenizer are organized hierarchically, allowing you to control the scope and target of normalization.
Use thetaigen
andyougen
flags to control which parts of speech are included in the normalization.
Setting | Default Value | Description |
---|---|---|
taigen | True | Includes nouns and other substantives in the normalization. Set toFalse to exclude them. |
yougen | False | Includes verbs and other predicates in the normalization. Set toTrue to include them (normalized to their lemma). |
The expansion flag determines how synonyms are expanded based on the synonym dictionary's internal control flags.
Value | Description |
---|---|
from_another | Expands only the synonyms with a control flag value of0 in the synonym dictionary. |
any | Expands all synonyms regardless of their control flag value. |
Specify thelevel of normalization with theunify_level
parameter.
Value | Description |
---|---|
lexeme | Performs the most comprehensive normalization, targetingall groups (a, b, c) mentioned below. |
word_form | Normalizes by word form, targetinggroups b and c. |
abbreviation | Normalizes by abbreviation, targetinggroup c only. |
Controls normalization based on vocabulary and semantics using the following settings:
Setting | Default Value | Description |
---|---|---|
other_language | True | Normalizes non-Japanese terms (e.g., English) to Japanese. Set toFalse to disable this feature. |
alias | True | Normalizes aliases. Set toFalse to disable this feature. |
old_name | True | Normalizes old names. Set toFalse to disable this feature. |
misuse | True | Normalizes misused terms. Set toFalse to disable this feature. |
Controls normalization of abbreviations using the following settings:
Setting | Default Value | Description |
---|---|---|
alphabetic_abbreviation | True | Normalizes abbreviations written in alphabetic characters. Set toFalse to disable this feature. |
non_alphabetic_abbreviation | True | Normalizes abbreviations written in non-alphabetic characters (e.g., Japanese). Set toFalse to disable this feature. |
Controls normalization of orthographic variations and errors using the following settings:
Setting | Default Value | Description |
---|---|---|
alphabet | True | Normalizes alphabetic variations. Set toFalse to disable this feature. |
orthographic_variation | True | Normalizes orthographic variations. Set toFalse to disable this feature. |
misspelling | True | Normalizes misspellings. Set toFalse to disable this feature. |
If you want to use a custom dictionary, control its behavior with the following setting:
Setting | Default Value | Description |
---|---|---|
custom_synonym | True | Enables the use of a custom dictionary. Set toFalse to disable it. |
This hierarchical configuration allows for flexible normalization by defining the scope and target in detail.
You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.
The custom dictionary file should be in JSON, CSV, or TSV format.
- JSON file
{"Representative word 1": ["Synonym 1_1","Synonym 1_2",...],"Representative word 2": ["Synonym 2_1","Synonym 2_2",...],}
- CSV file
Representative word 1,Synonym 1_1,Synonym 1_2,...Representative word 2,Synonym 2_1,Synonym 2_2,...
- TSV file
Representative word 1Synonym 1_1Synonym 1_2...Representative word 2Synonym 2_1Synonym 2_2......
If you create a file like the one below, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書".
- JSON file
{"幽遊白書": ["幽白","ゆうはく","幽☆遊☆白書"],}
- CSV file
幽遊白書,幽白,ゆうはく,幽☆遊☆白書
- TSV file
幽遊白書幽白ゆうはく幽☆遊☆白書
normalizer=SynonymNormalizer(custom_synonyms_file="path/to/custom_dict_file")
You can also normalize text using a CSV file.
JR東日本JR東JR-East
Normalize usingCsvSynonymNormalizer
as shown below.
fromyurenizerimportCsvSynonymNormalizerinput_file_path="input.csv"output_file_path="output.csv"csv_normalizer=CsvSynonymNormalizer(synonym_file_path="synonyms.txt")csv_normalizer.normalize_csv(input_file_path,output_file_path)
Theoutput.csv
file will be output as follows.
raw,normalizedJR東日本,東日本旅客鉄道JR東,東日本旅客鉄道JR-East,東日本旅客鉄道
The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in theSynonymNormalizer()
arguments:
pip install sudachidict_small# orpip install sudachidict_core
normalizer=SynonymNormalizer(sudachi_dict="small")# ornormalizer=SynonymNormalizer(sudachi_dict="core")
※ Please refer toSudachiDict documentation for details.
This project is licensed under theApache License 2.0.
- Sudachi Synonym Dictionary: Apache License 2.0
- SudachiPy: Apache License 2.0
- SudachiDict: Apache License 2.0
This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.
For detailed license information, please check the LICENSE files of each project:
- Sudachi Synonym Dictionary LICENSE※ Provided under the same license as the Sudachi dictionary.
- SudachiPy LICENSE
- SudachiDict LICENSE
About
Japanese text normalizer that resolves spelling inconsistencies.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.