Movatterモバイル変換

Juman++ v2:A Practical and ModernMorphological AnalyzerArseny TolmachevKyoto UniversityKurohashi-Kawahara lab2018-03-14#jumanppgithub.com/ku-nlp/jumanpp

What is Juman++外国人参政権[Morita+ EMNLP2015]外国foreign 人参carrot政権regime人person参政vote権right0.399 0.000030.0020.00210.909p(外国,人参,政権)= 0.252 x 10-8p(外国,人,参政,権)= 0.761 x 10-7Main idea: Use a Recurrent Neural Network Language Model toconsider semantic plausibility in addition to usual model score 2

Why Juman++? Accuracy!Kyoto University Text Corpus (NEWS),Kyoto University Web Document Leads Corpus (WEB)Setting10 million raw sentencescrawled from the webDataset for training RNNLM: Dataset for training base model and evaluation:3Juman++感想 | や | ご | 要望andimpression requesthonorificprefixJUMAN感想|やご|要望a larva of a dragonfly(F1)Word segmentation & POS tags97.497.697.89898.298.498.6JUMAN MeCab Juman++Number of fatal errors in 1,000 sentences020406080Juman++JUMAN

Why not Juman++? Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ 164

Why not Juman++ V1? Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ V1 16Juman++ V2 4,8005250x

Why Juman++ V2? Accuracy!!!6* = Optimized hyper-parameters on 10-fold cross-validationUsing the same Jumandic + concatenation of Kyoto/KWDLC corpora for training9596979899100Kyoto (Seg) + Pos Leads (Seg) + PosJUMANKyteaMecab*J++ V1J++ V2*F1

What is Juman++ v2 (in its core)A dictionary independentthread-safelibrary (not just a binary program)for morphological analysisusing lattice-based segmentationoptimized for n-best output7

Reasons of speedup• Dictionary representation• No non-necessary computations• Reduced search space• Cache-friendly data structures• Lattice => Struct of arrays• Code generation• Mostly branch-free feature extraction• Weight prefetching• RNN-specific: vectorization, batching8Algorithmic(Micro)architectural

Juman++ Model9𝑠= ∑𝑤𝑖 𝜙𝑖 + 𝛼(𝑠 𝑅𝑁𝑁 + 𝛽)For an inputsentence:Build a latticeAssign a score toeach paththrough thelattice Linear ModelFeaturesWeights……

Linear Model Features• BIGRAM (品詞)(品詞)• BIGRAM (品詞-細分類)(品詞-細分類)• TRIGRAM (品詞)(品詞)(品詞)...10Created based on ngram feature templatesUse dictionary info, surface characters, character type1,2,3-grams67 ngram feature templates for Jumandic, like:

Dictionary as column databaseDictionary field values become pointers11Field data isdeduplicatedand each fieldis storedseparatelyLength

Dictionary: Details• Dictionary is mmap-able• Loading from disk is almost free and cacheable by OS• What dictionary gives us• Handle strings as integers (data pointers)• Use data pointers as primitive features• Compute ngram features byhashing components together• Dictionary + smart hashing implementation =8x speedup12

Dic/model size on JumandicKytea Mecab Juman++ V1 Juman++ V2Dictionary - 311M 445M 158MModel 200M 7.7M 135M 16M13Raw dictionary (CSV with expanded inflections) is 256MBKytea doesn’t store all dictionary information:(It uses only surface, pos, subpos information)Note: 44% of V2 dictionary is DARTS trie

Quiz: Where is the bottleneck?1. Dictionary lookup/lattice construction• Trie lookup, dictionary access2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation14

Quiz: Where is the bottleneck?1. Dictionary lookup/lattice construction• Trie lookup2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation15Array access (not sum)was taking ~80% of alltime. Reason:L2 cache/dTLB misses.

Latency numbers everyprogrammer should knowAction Time CommentsL1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMain memory reference 100 ns 20x L2 cache, 200x L1 cache16Cache misses are really slow!Random memory access is almost guaranteed to be acache miss!Direction: Reduce number of memory accesses-> reduce number of score computations

Lattice and terms17BoundaryRightnodesBegin on theboundaryLeftnodesEnd on theboundaryAlso, t0Also, t1Contextor t2

Beams in Juman++18Each t1 node keepsonly a small numberof top-scoring pathsfrom t2Using beams is crucial for correctusage of high-order features

New in V2: Search space trimming19There could be a lot of paths going through the latticeMost of the candidate paths can’t be even remotely correct

Global Beam: Left20Instead of considering all paths ending on a boundary,consider only top k paths ending on the current boundary

Global Beam: Right21Use top m left paths to compute scores of right nodes.Compute the connections to remaining (k-m) left pathsonly for top l right nodes. (m=1, l=1 on the figure)Idea:Don’t considernodes whichdon’t makesense incontext

Effects of global beams• Almost every sentence has situations when there are20-30 left/right nodes• Juman++v2 default settings• Local beam size = 5• Left beam size = 6• Right beam size = 1/5• Considering much less candidates• In total, ~4x speedup• Accuracy considerations• Use surface characters after the current token as features• During training use smaller global beam sizes22Ranking procedureshould (at least) keepsensible paths in theright beam

Seg+POS F1 on different globalbeam parameters23Train themodel on oneset of globalbeamparameters,test onanother𝜌 here is thenumber ofboth left andright beamsNot usingglobal beam

Juman++ V2 RNNLM• V1 RNNLM implementation wasnon-batched and non-vectorizable• V2 RNNLM• Uses Eigen for vectorization• Batches all computations• Deduplicates paths with identical states• Finally, we evaluate only paths which have remained inEOS beam• Basically, RNN is used to reorder paths• Opinion: Evaluating the whole lattice is a waste of energy• Help wanted for LSTM language model implementation(Mikolov’s RNNLM now)24

Future work• Improve robustness on informal Japanese (web)• Create Jumanpp-Unidic• Use it to bootstrap correct readings in ourannotated corpora:• Kyoto Corpus• Kyoto Web Document Leads Corpus25

Conclusion• Fast trigram feature/RNNLM-based morphologicalanalyzer• Usable as library in multithreaded environment• Not hardwired to Jumandic!• Can be used with different dictionaries/standards• SOTA on Jumandic F1• Smallest model size (when without RNN)• Can use surface features (character/character type)• Can train with partially annotated data27github.com/ku-nlp/jumanpp#jumanpp

Movatterモバイル変換

Change Language

Juman++ v2: A Practical and Modern Morphological Analyzer

Recommended

More Related Content

What's hot

Recently uploaded

Juman++ v2: A Practical and Modern Morphological Analyzer

Editor's Notes