sea-turt1e/yurenizerPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star3

Japanese text normalizer that resolves spelling inconsistencies.

License

Apache-2.0 license

3 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github/workflows		.github/workflows
scripts		scripts
tests		tests
yurenizer		yurenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

yurenizer

This is a Japanese text normalizer that resolves spelling inconsistencies.

Japanese README is Here.（日本語のREADMEはこちら）
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md

Overview

yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow theSudachi Synonym Dictionary.

web-based Demo

~~You can try the web-based demo here.~~
~~yurenizer Web-demo~~
stopped the publication of Demo.

Installation

pip install yurenizer

Download Synonym Dictionary

curl -L -o synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt

Usage

Quick Start

fromyurenizerimportSynonymNormalizer,NormalizerConfignormalizer=SynonymNormalizer(synonym_file_path="synonyms.txt")text="「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"print(normalizer.normalize(text))# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。

Customizing Settings

You can control normalization by specifyingNormalizerConfig as an argument to the normalize function.

Example with Custom Settings

fromyurenizerimportSynonymNormalizer,NormalizerConfignormalizer=SynonymNormalizer(synonym_file_path="synonyms.txt")text="「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"config=NormalizerConfig(taigen=True,yougen=False,expansion="from_another",unify_level="lexeme",other_language=False,alias=False,old_name=False,misuse=False,alphabetic_abbreviation=True,# Normalize only alphabetic abbreviationsnon_alphabetic_abbreviation=False,alphabet=False,orthographic_variation=False,misspelling=False        )print(f"Output:{normalizer.normalize(text,config)}")# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます

Configuration Details

The settings inyurenizer are organized hierarchically, allowing you to control the scope and target of normalization.

1. taigen / yougen (Target Selection)

Use thetaigen andyougen flags to control which parts of speech are included in the normalization.

Setting	Default Value	Description
`taigen`	`True`	Includes nouns and other substantives in the normalization. Set to`False` to exclude them.
`yougen`	`False`	Includes verbs and other predicates in the normalization. Set to`True` to include them (normalized to their lemma).

2. expansion (Expansion Flag)

The expansion flag determines how synonyms are expanded based on the synonym dictionary's internal control flags.

Value	Description
`from_another`	Expands only the synonyms with a control flag value of`0` in the synonym dictionary.
`any`	Expands all synonyms regardless of their control flag value.

3. unify_level (Normalization Level)

Specify thelevel of normalization with theunify_level parameter.

Value	Description
`lexeme`	Performs the most comprehensive normalization, targetingall groups (a, b, c) mentioned below.
`word_form`	Normalizes by word form, targetinggroups b and c.
`abbreviation`	Normalizes by abbreviation, targetinggroup c only.

4. Detailed Normalization Settings (a, b, c Groups)

a Group: Comprehensive Lexical Normalization

Controls normalization based on vocabulary and semantics using the following settings:

Setting	Default Value	Description
`other_language`	`True`	Normalizes non-Japanese terms (e.g., English) to Japanese. Set to`False` to disable this feature.
`alias`	`True`	Normalizes aliases. Set to`False` to disable this feature.
`old_name`	`True`	Normalizes old names. Set to`False` to disable this feature.
`misuse`	`True`	Normalizes misused terms. Set to`False` to disable this feature.

b Group: Abbreviation Normalization

Controls normalization of abbreviations using the following settings:

Setting	Default Value	Description
`alphabetic_abbreviation`	`True`	Normalizes abbreviations written in alphabetic characters. Set to`False` to disable this feature.
`non_alphabetic_abbreviation`	`True`	Normalizes abbreviations written in non-alphabetic characters (e.g., Japanese). Set to`False` to disable this feature.

c Group: Orthographic Normalization

Controls normalization of orthographic variations and errors using the following settings:

Setting	Default Value	Description
`alphabet`	`True`	Normalizes alphabetic variations. Set to`False` to disable this feature.
`orthographic_variation`	`True`	Normalizes orthographic variations. Set to`False` to disable this feature.
`misspelling`	`True`	Normalizes misspellings. Set to`False` to disable this feature.

5. custom_synonym (Custom Dictionary)

If you want to use a custom dictionary, control its behavior with the following setting:

Setting	Default Value	Description
`custom_synonym`	`True`	Enables the use of a custom dictionary. Set to`False` to disable it.

This hierarchical configuration allows for flexible normalization by defining the scope and target in detail.

Custom Dictionary Specification

You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.

Custom Dictionary Format

The custom dictionary file should be in JSON, CSV, or TSV format.

JSON file

{"Representative word 1": ["Synonym 1_1","Synonym 1_2",...],"Representative word 2": ["Synonym 2_1","Synonym 2_2",...],}

CSV file

Representative word 1,Synonym 1_1,Synonym 1_2,...Representative word 2,Synonym 2_1,Synonym 2_2,...

TSV file

Representative word 1Synonym 1_1Synonym 1_2...Representative word 2Synonym 2_1Synonym 2_2......

Example

If you create a file like the one below, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書".

JSON file

{"幽遊白書": ["幽白","ゆうはく","幽☆遊☆白書"],}

CSV file

幽遊白書,幽白,ゆうはく,幽☆遊☆白書

TSV file

幽遊白書幽白ゆうはく幽☆遊☆白書

How to Specify

normalizer=SynonymNormalizer(custom_synonyms_file="path/to/custom_dict_file")

Normalization Using a CSV File

You can also normalize text using a CSV file.

Example

JR東日本JR東JR-East

Normalize usingCsvSynonymNormalizer as shown below.

fromyurenizerimportCsvSynonymNormalizerinput_file_path="input.csv"output_file_path="output.csv"csv_normalizer=CsvSynonymNormalizer(synonym_file_path="synonyms.txt")csv_normalizer.normalize_csv(input_file_path,output_file_path)

Theoutput.csv file will be output as follows.

raw,normalizedJR東日本,東日本旅客鉄道JR東,東日本旅客鉄道JR-East,東日本旅客鉄道

Specifying SudachiDict

The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in theSynonymNormalizer() arguments:

pip install sudachidict_small# orpip install sudachidict_core

normalizer=SynonymNormalizer(sudachi_dict="small")# ornormalizer=SynonymNormalizer(sudachi_dict="core")

※ Please refer toSudachiDict documentation for details.

License

This project is licensed under theApache License 2.0.

Open Source Software Used

Sudachi Synonym Dictionary: Apache License 2.0
SudachiPy: Apache License 2.0
SudachiDict: Apache License 2.0

This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.

For detailed license information, please check the LICENSE files of each project:

Sudachi Synonym Dictionary LICENSE※ Provided under the same license as the Sudachi dictionary.
SudachiPy LICENSE
SudachiDict LICENSE

About

Japanese text normalizer that resolves spelling inconsistencies.

Releases

16tags

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

License

sea-turt1e/yurenizer

Folders and files

Latest commit

History

Repository files navigation

yurenizer

Overview

web-based Demo

Installation

Download Synonym Dictionary

Usage

Quick Start

Customizing Settings

Example with Custom Settings

Configuration Details

1. taigen / yougen (Target Selection)

2. expansion (Expansion Flag)

3. unify_level (Normalization Level)

4. Detailed Normalization Settings (a, b, c Groups)

a Group: Comprehensive Lexical Normalization

b Group: Abbreviation Normalization

c Group: Orthographic Normalization

5. custom_synonym (Custom Dictionary)

Custom Dictionary Specification

Custom Dictionary Format

Example

How to Specify

Normalization Using a CSV File

Example

Specifying SudachiDict

License

Open Source Software Used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages