cocodrips/negimaPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star14

Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.

License

MIT license

14 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.circleci		.circleci
negima		negima
rules		rules
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
sample.py		sample.py
setup.py		setup.py

Repository files navigation

Negima

Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.

Negimaは日本語の文章の中から定義した品詞のルールにあてはまるフレーズを抽出するPythonパッケージです。

Installing

Install and update using pip:

$ pip install -U negima

Install usingsetup.py:

$ python setup.py install

Dependencies

mecab:http://taku910.github.io/mecab/

A Simple Example

sample.py

fromnegimaimportMorphemeMergermm=MorphemeMerger()# csvmm.set_rule_from_csv('rules/1_noun.csv')# tsv# mm.set_rule_from_csv('rules/1_noun.tsv', sep='\t')# # excel# mm.set_rule_from_excel('rules/rules.xlsx', sheet_name='1_noun')words,_=mm.get_rule_pattern('今日はいい天気')print(words)

$ python sample.py  ['今日','天気']

Rule

You can define　rules in a csv, tsv or excel format.
A rule file requires following 9 columns.
Define one of part-of-speeches each row.

ルールはcsv, tsv, excelファイルの形式で定義することができます。
ルールには以下の9種のカラムが必要になります。また、1行には1形態素の品詞の情報を定義します。

id
- A rule starts with non-empty id column.
  idが空でなければ、ルールのスタートを示す
- id has to be unique.
  idはユニークである必要がある
- Rules are applied in ascendings order of id (ids are compared as UTF-8 strings, not as byte arrays).
  ex: id:000_XXX has priority over id:999_ZZZ
  idは文字列としてsortされて小さい順にそのルールの優先度が定義される
  例: id:000_XXXのルールはid:999_ZZZのルールよりも優先度が高い
min
- Minimum repeat number. 0 means that morpheme is optional.
  形態素の最小繰り返し回数。0に設定するとそのパーツはあってもなくても良い
- default=1
max
- Maximum repeat number
  形態素の最大繰り返し回数
- default=1
pos0, pos1, pos2, pos3, pos4, pos5
- Part of speeches of morphemes parsed by mecab.
  mecabでparseされた形態素の品詞や活用の名前
  - pos0: 表層 (ex: 名詞)
  - pos1: 品詞1 (ex: 副詞可能)
  - pos2: 品詞2
  - pos3: 品詞3
  - pos4: 活用1
  - pos5: 活用2
- To represent OR condition, concatenate part-of-speeches with| as a separator.
  |で品詞を接続することでOR条件の定義が可能である

You can add arbitrary columns to your rule file. other columns are just ignored.An example is available atrule/3_independent_phrase.csv, which has a row example that describes an example sentence for the rule.

上記以外にも任意の列の追加が可能です。
rule/3_independent_phrase.csvではexampleという列を追加し、ルールにあてはまるサンプルを記述しています。

Simple rule (csv)

A rule to extract compound noun.このようなルールを定義することで、複合名詞を抽出できます

id	min	max	pos0	pos1
1	0	2	接頭詞
	1	4	名詞	一般\|サ変接続\|数
	0	2	名詞	接尾

CautionDon't insert empty row between rules.

注意ルール同士の間に空行をはさまないようにすること

Rule samples

rule/1_noun.csv

Extract nouns.
名詞の抽出

約5000人が国立競技場に駆けつけた ->5000人国立競技場
場所がわかりにくいのでたどり着けなかった ->場所

rule/2_nouns.csv

Extract compound nouns.
複合名詞の抽出

約5000人が国立競技場に駆けつけた ->約5000人国立競技場
場所がわかりにくいのでたどり着けなかった ->場所

rule/3_independent_phrase.csv

Extract a little complex phrase.
形容詞や否定の「ない」を含んだ少し複雑なルールのフェーズの抽出

新人研修のレベルは高い ->新人研修レベルは高い
あのサイトはホテルの比較がしやすくないので好きではない ->サイトホテル比較がしやすくない好きではない

About

Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.

Releases2

0.1.3 Latest

Aug 19, 2018

+ 1 release

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Negima

Installing

Dependencies

A Simple Example

Rule

Simple rule (csv)

Rule samples

rule/1_noun.csv

rule/2_nouns.csv

rule/3_independent_phrase.csv

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Languages

Movatterモバイル変換

License

cocodrips/negima

Folders and files

Latest commit

History

Repository files navigation

Negima

Installing

Dependencies

A Simple Example

Rule

Simple rule (csv)

Rule samples

rule/1_noun.csv

rule/2_nouns.csv

rule/3_independent_phrase.csv

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Languages

Packages