- Notifications
You must be signed in to change notification settings - Fork3
👺 tokenizer specified for Japanese
License
SamuraiT/tinysegmenter
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TinySegmenter -- Super compact Japanese tokenizer was originally created by(c) 2008 Taku Kudo for javascript under the terms of a new BSD licence.For details, seehere
tinysegmenter for python2.x was written by Masato Hagiwara.for his information seehere
This tinysegmenter is modified for python3.x and python2.x for distribution by Tatsuro Yasukawa.Additionaly, this tinysegmenter is modified for being more faster - thanks to@chezou, @cocoatomo and @methane.
See info abouttinysegmenter
pip install tinysegmenter3
importtinysegmenterstatement='私はpython大好きStanding Engineerです.'tokenized_statement=tinysegmenter.tokenize(statement)print(tokenized_statement)# ['私', 'は', 'python', '大好き', 'Standing', ' Engineer', 'です', '.']
Thetest text (in thetests
directory) wasThe Time Machine by H.G. Wells, translated to Japanese by Hiroo Yamagata under the CC BY-SA 2.0 License.
Install requirements fromrequirements.txt
by
pipinstall-rrequirements.txt
then run this:
./runtests.sh
About
👺 tokenizer specified for Japanese
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.