wwwcojp/ja_sentence_segmenterPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star71

japanese sentence segmentation library for python

wwwcojp.github.io/ja_sentence_segmenter/ja_sentence_segmenter.html

License

MIT license

71 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
ja_sentence_segmenter		ja_sentence_segmenter
tests		tests
theme		theme
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
tox.ini		tox.ini

Repository files navigation

ja_sentence_segmenter

日本語のテキストに対して、ルールベースによる文区切り（sentence segmentation）を行います。

Getting Started

Prerequisites

Python 3.6+

Installing

pip install ja_sentence_segmenter

Usage

importfunctoolsfromja_sentence_segmenter.common.pipelineimportmake_pipelinefromja_sentence_segmenter.concatenate.simple_concatenatorimportconcatenate_matchingfromja_sentence_segmenter.normalize.neologd_normalizerimportnormalizefromja_sentence_segmenter.split.simple_splitterimportsplit_newline,split_punctuationsplit_punc2=functools.partial(split_punctuation,punctuations=r"。!?")concat_tail_no=functools.partial(concatenate_matching,former_matching_rule=r"^(?P<result>.+)(の)$",remove_former_matched=False)segmenter=make_pipeline(normalize,split_newline,concat_tail_no,split_punc2)# Golden Rule: Simple period to end sentence #001 (from https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb#L6)text1="これはペンです。それはマーカーです。"print(list(segmenter(text1)))

> ["これはペンです。", "それはマーカーです。"]

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Contributing

TODO

License

MIT

Acknowledgments

テキストの正規化処理

テキスト正規化のコードは、mecab-ipadic-NEologdの以下のWIKIを参考に一部修正を加えています。

サンプルコードの提供者であるhideaki-t氏とoverlast氏に感謝します。

https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast

文区切り（sentence segmentation）のルール

文区切りのルールとして、Pragmatic Segmenterの日本語ルールを参考にしました。

https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese

また、以下のテストコード中で用いられているテストデータを、本PJのテストコードで利用しました。

https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb

作者のKevin S. Dias氏とコントリビュータの方々に感謝します。

Thanks to Kevin S. Dias andcontributors.

About

japanese sentence segmentation library for python

wwwcojp.github.io/ja_sentence_segmenter/ja_sentence_segmenter.html

Releases

2tags

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ja_sentence_segmenter

Getting Started

Prerequisites

Installing

Usage

Versioning

Contributing

License

Acknowledgments

テキストの正規化処理

文区切り（sentence segmentation）のルール

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

wwwcojp/ja_sentence_segmenter

Folders and files

Latest commit

History

Repository files navigation

ja_sentence_segmenter

Getting Started

Prerequisites

Installing

Usage

Versioning

Contributing

License

Acknowledgments

テキストの正規化処理

文区切り（sentence segmentation）のルール

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages