notmecab-rs is a very basic mecab clone, designed only to do parsing, not training.
This is meant to be used as a library by other tools such as frequency analyzers. Not directly by people.It also only works with UTF-8 dictionaries. (Stop using encodings other than UTF-8 for infrastructural software.)
Licensed under the Apache License, Version 2.0.
Get unidic's sys.dic, matrix.bin, unk.dic, and char.bin and put them in data/. Then invoke tests from the repository root. Unidic 2.3.0 (spoken language or written language variant, not kobun etc) is assumed, otherwise some tests will fail.
notmecab performs maginally worse than mecab, but there are many cases where mecab fails to find the lowest-cost string of tokens, so I'm pretty sure that mecab is just cutting corners somewhere performance sensitive when searching for an ideal parse.
There are a couple difficult-to-use caching features designed to improve performance. You can upload a matrix of connections between the most common connection edge types withprepare_fast_matrix_cache, which is for extremely large dictionaries like modern versions of unidic, or you can load the entire matrix connection cache into memory withprepare_full_matrix_cache, which is for small dictionaries like ipadic. Note thatprepare_full_matrix_cache is actually slower thanprepare_fast_matrix_cache for modern versions of unidic after long periods of pumping text through notmecab, though obviouslyprepare_full_matrix_cache is the best option for small dictionaries.
代名詞,*,*,*,*,*,コレ,此れ,これ,コレ,これ,コレ,和,*,*,*,*,*,*,体,コレ,コレ,コレ,コレ,0,*,*,3599534815060480,13095助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*,*,*,格助,ヲ,ヲ,ヲ,ヲ,*,"動詞%F2@0,名詞%F1,形容詞%F2@-1",*,11381878116459008,41407動詞,一般,*,*,五段-タ行,連用形-促音便,モツ,持つ,持っ,モッ,持つ,モツ,和,*,*,*,*,*,*,用,モッ,モツ,モッ,モツ,1,C1,*,10391493084848772,37804助詞,接続助詞,*,*,*,*,テ,て,て,テ,て,テ,和,*,*,*,*,*,*,接助,テ,テ,テ,テ,*,"動詞%F1,形容詞%F2@-1",*,6837321680953856,24874動詞,非自立可能,*,*,五段-カ行,命令形,イク,行く,いけ,イケ,いく,イク,和,*,*,*,*,*,*,用,イケ,イク,イケ,イク,0,C2,*,470874478224161,1713これ|を|持っ|て|いけ
You can also call parse_to_lexertoken, which does less string allocation, but you don't get the feature string as a string.