Movatterモバイル変換


[0]ホーム

URL:


Uploaded byeiennohito
PPTX, PDF1,861 views

Juman++ v2: A Practical and Modern Morphological Analyzer

Juman++ v2 is a fast and accurate morphological analyzer for Japanese. It uses a combination of linear and neural network models to assign part-of-speech tags and analyze word segmentation. Version 2 improves on speed by optimizing dictionary representation and reducing search space through "global beam" pruning. It achieves state-of-the-art accuracy on standard datasets while using a smaller model size than prior versions. Future work will focus on improving performance on informal language and integrating it with other Japanese language resources.

Downloaded 16 times
Juman++ v2:A Practical and ModernMorphological AnalyzerArseny TolmachevKyoto UniversityKurohashi-Kawahara lab2018-03-14#jumanppgithub.com/ku-nlp/jumanpp
What is Juman++外国人参政権[Morita+ EMNLP2015]外 国foreign 人 参carrot政 権regime人person参 政vote権right0.399 0.000030.0020.00210.909p(外国,人参,政権)= 0.252 x 10-8p(外国,人,参政,権)= 0.761 x 10-7Main idea: Use a Recurrent Neural Network Language Model toconsider semantic plausibility in addition to usual model score 2
Why Juman++? Accuracy!Kyoto University Text Corpus (NEWS),Kyoto University Web Document Leads Corpus (WEB)Setting10 million raw sentencescrawled from the webDataset for training RNNLM: Dataset for training base model and evaluation:3Juman++感想 | や | ご | 要望andimpression requesthonorificprefixJUMAN感想|やご|要望a larva of a dragonfly(F1)Word segmentation & POS tags97.497.697.89898.298.498.6JUMAN MeCab Juman++Number of fatal errors in 1,000 sentences020406080Juman++JUMAN
Why not Juman++? Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ 164
Why not Juman++ V1? Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ V1 16Juman++ V2 4,8005250x
Why Juman++ V2? Accuracy!!!6* = Optimized hyper-parameters on 10-fold cross-validationUsing the same Jumandic + concatenation of Kyoto/KWDLC corpora for training9596979899100Kyoto (Seg) + Pos Leads (Seg) + PosJUMANKyteaMecab*J++ V1J++ V2*F1
What is Juman++ v2 (in its core)A dictionary independentthread-safelibrary (not just a binary program)for morphological analysisusing lattice-based segmentationoptimized for n-best output7
Reasons of speedup• Dictionary representation• No non-necessary computations• Reduced search space• Cache-friendly data structures• Lattice => Struct of arrays• Code generation• Mostly branch-free feature extraction• Weight prefetching• RNN-specific: vectorization, batching8Algorithmic(Micro)architectural
Juman++ Model9𝑠= ∑𝑤𝑖 𝜙𝑖 + 𝛼(𝑠 𝑅𝑁𝑁 + 𝛽)For an inputsentence:Build a latticeAssign a score toeach paththrough thelattice Linear ModelFeaturesWeights……
Linear Model Features• BIGRAM (品詞)(品詞)• BIGRAM (品詞-細分類)(品詞-細分類)• TRIGRAM (品詞)(品詞)(品詞)...10Created based on ngram feature templatesUse dictionary info, surface characters, character type1,2,3-grams67 ngram feature templates for Jumandic, like:
Dictionary as column databaseDictionary field values become pointers11Field data isdeduplicatedand each fieldis storedseparatelyLength
Dictionary: Details• Dictionary is mmap-able• Loading from disk is almost free and cacheable by OS• What dictionary gives us• Handle strings as integers (data pointers)• Use data pointers as primitive features• Compute ngram features byhashing components together• Dictionary + smart hashing implementation =8x speedup12
Dic/model size on JumandicKytea Mecab Juman++ V1 Juman++ V2Dictionary - 311M 445M 158MModel 200M 7.7M 135M 16M13Raw dictionary (CSV with expanded inflections) is 256MBKytea doesn’t store all dictionary information:(It uses only surface, pos, subpos information)Note: 44% of V2 dictionary is DARTS trie
Quiz: Where is the bottleneck?1. Dictionary lookup/lattice construction• Trie lookup, dictionary access2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation14
Quiz: Where is the bottleneck?1. Dictionary lookup/lattice construction• Trie lookup2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation15Array access (not sum)was taking ~80% of alltime. Reason:L2 cache/dTLB misses.
Latency numbers everyprogrammer should knowAction Time CommentsL1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMain memory reference 100 ns 20x L2 cache, 200x L1 cache16Cache misses are really slow!Random memory access is almost guaranteed to be acache miss!Direction: Reduce number of memory accesses-> reduce number of score computations
Lattice and terms17BoundaryRightnodesBegin on theboundaryLeftnodesEnd on theboundaryAlso, t0Also, t1Contextor t2
Beams in Juman++18Each t1 node keepsonly a small numberof top-scoring pathsfrom t2Using beams is crucial for correctusage of high-order features
New in V2: Search space trimming19There could be a lot of paths going through the latticeMost of the candidate paths can’t be even remotely correct
Global Beam: Left20Instead of considering all paths ending on a boundary,consider only top k paths ending on the current boundary
Global Beam: Right21Use top m left paths to compute scores of right nodes.Compute the connections to remaining (k-m) left pathsonly for top l right nodes. (m=1, l=1 on the figure)Idea:Don’t considernodes whichdon’t makesense incontext
Effects of global beams• Almost every sentence has situations when there are20-30 left/right nodes• Juman++v2 default settings• Local beam size = 5• Left beam size = 6• Right beam size = 1/5• Considering much less candidates• In total, ~4x speedup• Accuracy considerations• Use surface characters after the current token as features• During training use smaller global beam sizes22Ranking procedureshould (at least) keepsensible paths in theright beam
Seg+POS F1 on different globalbeam parameters23Train themodel on oneset of globalbeamparameters,test onanother𝜌 here is thenumber ofboth left andright beamsNot usingglobal beam
Juman++ V2 RNNLM• V1 RNNLM implementation wasnon-batched and non-vectorizable• V2 RNNLM• Uses Eigen for vectorization• Batches all computations• Deduplicates paths with identical states• Finally, we evaluate only paths which have remained inEOS beam• Basically, RNN is used to reorder paths• Opinion: Evaluating the whole lattice is a waste of energy• Help wanted for LSTM language model implementation(Mikolov’s RNNLM now)24
Future work• Improve robustness on informal Japanese (web)• Create Jumanpp-Unidic• Use it to bootstrap correct readings in ourannotated corpora:• Kyoto Corpus• Kyoto Web Document Leads Corpus25
26
Conclusion• Fast trigram feature/RNNLM-based morphologicalanalyzer• Usable as library in multithreaded environment• Not hardwired to Jumandic!• Can be used with different dictionaries/standards• SOTA on Jumandic F1• Smallest model size (when without RNN)• Can use surface features (character/character type)• Can train with partially annotated data27github.com/ku-nlp/jumanpp#jumanpp

Recommended

PDF
CV分野での最近の脱○○系3選
PDF
AI2018 8 ニューラルネットワークの基礎
PDF
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
PPTX
歌声分析のエンタテイメント応用
PPTX
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PDF
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
PDF
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
PPTX
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...
PDF
GiNZAで始める日本語依存構造解析 〜CaboCha, UDPipe, Stanford NLPとの比較〜
PDF
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
PDF
論文紹介:InternVideo: General Video Foundation Models via Generative and Discrimi...
PDF
機械学習 入門
PDF
[DL輪読会]Network Deconvolution
PDF
【AI論文解説】Consistency ModelとRectified Flow
PDF
物体検知(Meta Study Group 発表資料)
PDF
【チュートリアル】コンピュータビジョンによる動画認識
PDF
機械学習チュートリアル@Jubatus Casual Talks
PDF
トピックモデルの評価指標 Perplexity とは何なのか?
 
PDF
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
PPTX
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
PPTX
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
PDF
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
PDF
Bert for multimodal
PPTX
【論文読み会】Autoregressive Diffusion Models.pptx
PPTX
【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...
PDF
機械学習と主成分分析
PPTX
【論文読み会】Deep Reinforcement Learning at the Edge of the Statistical Precipice
PPTX
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
PDF
Immer vorne mit dabei sein - Roaming leicht gemacht
PPTX
RAPTOR_Intro_Exercises_Presentation.pptx

More Related Content

PDF
CV分野での最近の脱○○系3選
PDF
AI2018 8 ニューラルネットワークの基礎
PDF
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
PPTX
歌声分析のエンタテイメント応用
PPTX
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PDF
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
PDF
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
PPTX
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...
CV分野での最近の脱○○系3選
AI2018 8 ニューラルネットワークの基礎
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
歌声分析のエンタテイメント応用
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...

What's hot

PDF
GiNZAで始める日本語依存構造解析 〜CaboCha, UDPipe, Stanford NLPとの比較〜
PDF
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
PDF
論文紹介:InternVideo: General Video Foundation Models via Generative and Discrimi...
PDF
機械学習 入門
PDF
[DL輪読会]Network Deconvolution
PDF
【AI論文解説】Consistency ModelとRectified Flow
PDF
物体検知(Meta Study Group 発表資料)
PDF
【チュートリアル】コンピュータビジョンによる動画認識
PDF
機械学習チュートリアル@Jubatus Casual Talks
PDF
トピックモデルの評価指標 Perplexity とは何なのか?
 
PDF
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
PPTX
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
PPTX
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
PDF
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
PDF
Bert for multimodal
PPTX
【論文読み会】Autoregressive Diffusion Models.pptx
PPTX
【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...
PDF
機械学習と主成分分析
PPTX
【論文読み会】Deep Reinforcement Learning at the Edge of the Statistical Precipice
PPTX
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
GiNZAで始める日本語依存構造解析 〜CaboCha, UDPipe, Stanford NLPとの比較〜
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
論文紹介:InternVideo: General Video Foundation Models via Generative and Discrimi...
機械学習 入門
[DL輪読会]Network Deconvolution
【AI論文解説】Consistency ModelとRectified Flow
物体検知(Meta Study Group 発表資料)
【チュートリアル】コンピュータビジョンによる動画認識
機械学習チュートリアル@Jubatus Casual Talks
トピックモデルの評価指標 Perplexity とは何なのか?
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
Bert for multimodal
【論文読み会】Autoregressive Diffusion Models.pptx
【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...
機械学習と主成分分析
【論文読み会】Deep Reinforcement Learning at the Edge of the Statistical Precipice
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)

Recently uploaded

PDF
Immer vorne mit dabei sein - Roaming leicht gemacht
PPTX
RAPTOR_Intro_Exercises_Presentation.pptx
PDF
Dev Dives: Unlocking Enterprise Data with UiPath Data Fabric
PDF
Session 7 Specialized AI Associate Series: UiPath Communications Mining Overview
PDF
Performance.sync(): 7 Levels of a Web Performance Journey
PDF
GDG Boise - Innovating with Google Cloud Artificial Intelligence
PPTX
2025 - AI terms & Meanings for Marketers
PDF
API Days Australia - API Management in the AI Era
PDF
Jonathan De Vita - Java and AI Development
PDF
How UK Employers Are Simplifying Entry-Level Hiring with Day One Jobs
PDF
2025 AI Adoption Report: Gen AI Fast-Tracks Into the Enterprise
PDF
Transcript: Embedding sustainability: Tips for ebook and print production - T...
PDF
Unlocking Access: Enriching Public Services with Location Intelligence.pdf
PDF
How Generative AI Helps US Healthcare Startups Save 40.pdf
PDF
How does a computer vision development company help retailers decode consumer...
DOCX
Formula 3 - MACO using general limit.docx
PDF
The Future-Ready PMO: Unlocking Project Success with Copilot and AI Innovatio...
DOCX
Formula 1 - APIC general MACO calculation.docx
PDF
NPTEL Assignment Completed PDF FILE with ans
PDF
Transcript: Ready, set, go: Pre-fall sales trends and data-driven insights - ...
Immer vorne mit dabei sein - Roaming leicht gemacht
RAPTOR_Intro_Exercises_Presentation.pptx
Dev Dives: Unlocking Enterprise Data with UiPath Data Fabric
Session 7 Specialized AI Associate Series: UiPath Communications Mining Overview
Performance.sync(): 7 Levels of a Web Performance Journey
GDG Boise - Innovating with Google Cloud Artificial Intelligence
2025 - AI terms & Meanings for Marketers
API Days Australia - API Management in the AI Era
Jonathan De Vita - Java and AI Development
How UK Employers Are Simplifying Entry-Level Hiring with Day One Jobs
2025 AI Adoption Report: Gen AI Fast-Tracks Into the Enterprise
Transcript: Embedding sustainability: Tips for ebook and print production - T...
Unlocking Access: Enriching Public Services with Location Intelligence.pdf
How Generative AI Helps US Healthcare Startups Save 40.pdf
How does a computer vision development company help retailers decode consumer...
Formula 3 - MACO using general limit.docx
The Future-Ready PMO: Unlocking Project Success with Copilot and AI Innovatio...
Formula 1 - APIC general MACO calculation.docx
NPTEL Assignment Completed PDF FILE with ans
Transcript: Ready, set, go: Pre-fall sales trends and data-driven insights - ...

Juman++ v2: A Practical and Modern Morphological Analyzer

  • 1.
    Juman++ v2:A Practicaland ModernMorphological AnalyzerArseny TolmachevKyoto UniversityKurohashi-Kawahara lab2018-03-14#jumanppgithub.com/ku-nlp/jumanpp
  • 2.
    What is Juman++外国人参政権[Morita+EMNLP2015]外 国foreign 人 参carrot政 権regime人person参 政vote権right0.399 0.000030.0020.00210.909p(外国,人参,政権)= 0.252 x 10-8p(外国,人,参政,権)= 0.761 x 10-7Main idea: Use a Recurrent Neural Network Language Model toconsider semantic plausibility in addition to usual model score 2
  • 3.
    Why Juman++? Accuracy!KyotoUniversity Text Corpus (NEWS),Kyoto University Web Document Leads Corpus (WEB)Setting10 million raw sentencescrawled from the webDataset for training RNNLM: Dataset for training base model and evaluation:3Juman++感想 | や | ご | 要望andimpression requesthonorificprefixJUMAN感想|やご|要望a larva of a dragonfly(F1)Word segmentation & POS tags97.497.697.89898.298.498.6JUMAN MeCab Juman++Number of fatal errors in 1,000 sentences020406080Juman++JUMAN
  • 4.
    Why not Juman++?Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ 164
  • 5.
    Why not Juman++V1? Speed!Analyzer Analysis speed(Sentences/Second)Mecab 53,000KyTea 2,000JUMAN 8,800Juman++ V1 16Juman++ V2 4,8005250x
  • 6.
    Why Juman++ V2?Accuracy!!!6* = Optimized hyper-parameters on 10-fold cross-validationUsing the same Jumandic + concatenation of Kyoto/KWDLC corpora for training9596979899100Kyoto (Seg) + Pos Leads (Seg) + PosJUMANKyteaMecab*J++ V1J++ V2*F1
  • 7.
    What is Juman++v2 (in its core)A dictionary independentthread-safelibrary (not just a binary program)for morphological analysisusing lattice-based segmentationoptimized for n-best output7
  • 8.
    Reasons of speedup•Dictionary representation• No non-necessary computations• Reduced search space• Cache-friendly data structures• Lattice => Struct of arrays• Code generation• Mostly branch-free feature extraction• Weight prefetching• RNN-specific: vectorization, batching8Algorithmic(Micro)architectural
  • 9.
    Juman++ Model9𝑠= ∑𝑤𝑖𝜙𝑖 + 𝛼(𝑠 𝑅𝑁𝑁 + 𝛽)For an inputsentence:Build a latticeAssign a score toeach paththrough thelattice Linear ModelFeaturesWeights……
  • 10.
    Linear Model Features•BIGRAM (品詞)(品詞)• BIGRAM (品詞-細分類)(品詞-細分類)• TRIGRAM (品詞)(品詞)(品詞)...10Created based on ngram feature templatesUse dictionary info, surface characters, character type1,2,3-grams67 ngram feature templates for Jumandic, like:
  • 11.
    Dictionary as columndatabaseDictionary field values become pointers11Field data isdeduplicatedand each fieldis storedseparatelyLength
  • 12.
    Dictionary: Details• Dictionaryis mmap-able• Loading from disk is almost free and cacheable by OS• What dictionary gives us• Handle strings as integers (data pointers)• Use data pointers as primitive features• Compute ngram features byhashing components together• Dictionary + smart hashing implementation =8x speedup12
  • 13.
    Dic/model size onJumandicKytea Mecab Juman++ V1 Juman++ V2Dictionary - 311M 445M 158MModel 200M 7.7M 135M 16M13Raw dictionary (CSV with expanded inflections) is 256MBKytea doesn’t store all dictionary information:(It uses only surface, pos, subpos information)Note: 44% of V2 dictionary is DARTS trie
  • 14.
    Quiz: Where isthe bottleneck?1. Dictionary lookup/lattice construction• Trie lookup, dictionary access2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation14
  • 15.
    Quiz: Where isthe bottleneck?1. Dictionary lookup/lattice construction• Trie lookup2. Feature computation• Hashing, many conditionals3. Score computation• Score += weights[feature & (length – 1)];4. Output• Formatting, string concatenation15Array access (not sum)was taking ~80% of alltime. Reason:L2 cache/dTLB misses.
  • 16.
    Latency numbers everyprogrammershould knowAction Time CommentsL1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMain memory reference 100 ns 20x L2 cache, 200x L1 cache16Cache misses are really slow!Random memory access is almost guaranteed to be acache miss!Direction: Reduce number of memory accesses-> reduce number of score computations
  • 17.
    Lattice and terms17BoundaryRightnodesBeginon theboundaryLeftnodesEnd on theboundaryAlso, t0Also, t1Contextor t2
  • 18.
    Beams in Juman++18Eacht1 node keepsonly a small numberof top-scoring pathsfrom t2Using beams is crucial for correctusage of high-order features
  • 19.
    New in V2:Search space trimming19There could be a lot of paths going through the latticeMost of the candidate paths can’t be even remotely correct
  • 20.
    Global Beam: Left20Insteadof considering all paths ending on a boundary,consider only top k paths ending on the current boundary
  • 21.
    Global Beam: Right21Usetop m left paths to compute scores of right nodes.Compute the connections to remaining (k-m) left pathsonly for top l right nodes. (m=1, l=1 on the figure)Idea:Don’t considernodes whichdon’t makesense incontext
  • 22.
    Effects of globalbeams• Almost every sentence has situations when there are20-30 left/right nodes• Juman++v2 default settings• Local beam size = 5• Left beam size = 6• Right beam size = 1/5• Considering much less candidates• In total, ~4x speedup• Accuracy considerations• Use surface characters after the current token as features• During training use smaller global beam sizes22Ranking procedureshould (at least) keepsensible paths in theright beam
  • 23.
    Seg+POS F1 ondifferent globalbeam parameters23Train themodel on oneset of globalbeamparameters,test onanother𝜌 here is thenumber ofboth left andright beamsNot usingglobal beam
  • 24.
    Juman++ V2 RNNLM•V1 RNNLM implementation wasnon-batched and non-vectorizable• V2 RNNLM• Uses Eigen for vectorization• Batches all computations• Deduplicates paths with identical states• Finally, we evaluate only paths which have remained inEOS beam• Basically, RNN is used to reorder paths• Opinion: Evaluating the whole lattice is a waste of energy• Help wanted for LSTM language model implementation(Mikolov’s RNNLM now)24
  • 25.
    Future work• Improverobustness on informal Japanese (web)• Create Jumanpp-Unidic• Use it to bootstrap correct readings in ourannotated corpora:• Kyoto Corpus• Kyoto Web Document Leads Corpus25
  • 26.
  • 27.
    Conclusion• Fast trigramfeature/RNNLM-based morphologicalanalyzer• Usable as library in multithreaded environment• Not hardwired to Jumandic!• Can be used with different dictionaries/standards• SOTA on Jumandic F1• Smallest model size (when without RNN)• Can use surface features (character/character type)• Can train with partially annotated data27github.com/ku-nlp/jumanpp#jumanpp

Editor's Notes


[8]ページ先頭

©2009-2025 Movatter.jp