Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A Japanese NLP Library using spaCy as framework based on Universal Dependencies

License

NotificationsYou must be signed in to change notification settings

megagonlabs/ginza

Repository files navigation

GiNZA logo

GiNZA NLP Library

TweetDownloads

An Open Source Japanese NLP Library, based on Universal Dependencies

Please read theImportant changes before you upgrade GiNZA.

日本語ページはこちら

License

GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under theMIT License.You must agree and follow the MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.

Explosion / spaCy

spaCy is the key framework of GiNZA.

spaCy LICENSE PAGE

Works Applications Enterprise / Sudachi/SudachiPy - SudachiDict - chiVe

SudachiPy provides high accuracies for tokenization and pos tagging.

Sudachi LICENSE PAGE,SudachiPy LICENSE PAGE,SudachiDict LEGAL PAGE,chiVe LICENSE PAGE

Hugging Face / transformers

The GiNZA v5 Transformers model (ja_ginza_electra) is trained by using Hugging Face Transformers as a framework for pretrained models.

transformers LICENSE PAGE

Training Datasets

UD Japanese BCCWJ r2.8

The parsing model of GiNZA v5 is trained on a part ofUD Japanese BCCWJ r2.8(Omura and Asahara:2018).This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

GSK2014-A (2019) BCCWJ edition

The named entity recognition model of GiNZA v5 is trained on a part ofGSK2014-A (2019) BCCWJ edition(Hashimoto, Inui, and Murakami:2008).We use two of the named entity label systems, bothSekine's Extended Named Entity Hierarchyand extendedOntoNotes5.This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

mC4

The GiNZA v5 Transformers model (ja_ginza_electra) is trained by usingtransformers-ud-japanese-electra-base-discriminator which is pretrained on more than 200 million Japanese sentences extracted frommC4.

Contains information from mC4 which is made available under the ODC Attribution License.

@article{2019t5,    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},    journal = {arXiv e-prints},    year = {2019},    archivePrefix = {arXiv},    eprint = {1910.10683},}

Runtime Environment

This project is developed with Python>=3.8 and pip for it.We do not recommend to use Anaconda environment because the pip install step may not work properly.

Please also see the Development Environment section below.

Runtime set up

1. Install GiNZA NLP Library with Transformer-based Model

Uninstall previous version of ginza and ja_ginza_electra packages:

$pip uninstall ginza ja_ginza_electra

Then, install the latest version ofginza andja_ginza_electra:

$pip install -U ginza ja_ginza_electra

The package ofja_ginza_electra does not includepytorch_model.bin due to PyPI's archive size restrictions.This large model file will be automatically downloaded at the first run time, and the locally cached file will be used for subsequent runs.

If you need to installja_ginza_electra along withpytorch_model.bin at the install time, you can specify direct link for GitHub release archive as follows:

$pip install -U ginza https://github.com/megagonlabs/ginza/releases/download/latest/ja_ginza_electra-latest-with-model.tar.gz

If you hope to accelarate the transformers-based models by using GPUs with CUDA support, you can installspacy by specifying the CUDA version as follows:

pip install -U "spacy[cuda117]"

And you need to install a version of pytorch that is consistent with the CUDA version.

2. Install GiNZA NLP Library with Standard Model

Uninstall previous version:

$pip uninstall ginza ja_ginza

Then, install the latest version ofginza andja_ginza:

$pip install -U ginza ja_ginza

When using Apple Silicon such as M1 or M2, you can accelerate the analysis process by installingthinc-apple-ops:

$pip install torch thinc-apple-ops

Execute ginza command

Runginza command from the console, then input some Japanese text.After pressing enter key, you will get the parsed results withCoNLL-U Syntactic Annotation format.

$ginza銀座でランチをご一緒しましょう。#text = 銀座でランチをご一緒しましょう。1       銀座    銀座    PROPN   名詞-固有名詞-地名-一般 _       6       nmod    _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ギンザ|NE=B-GPE|ENE=B-City|ClauseHead=62       で      で      ADP     助詞-格助詞     _       1       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=デ|ClauseHead=63       ランチ  ランチ  NOUN    名詞-普通名詞-一般      _       6       obj     _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ランチ|ClauseHead=64       を      を      ADP     助詞-格助詞     _       3       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ヲ|ClauseHead=65       ご      ご      NOUN    接頭辞  _       6       compound        _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|NP_B|Reading=ゴ|ClauseHead=66       一緒    一緒    NOUN    名詞-普通名詞-サ変可能  _       0       root    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|NP_I|Reading=イッショ|ClauseHead=67       し      する    AUX     動詞-非自立可能 _       6       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=サ行変格,連用形-一般|Reading=シ|ClauseHead=68       ましょう        ます    AUX     助動詞  _       6       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ|ClauseHead=69       。      。      PUNCT   補助記号-句点   _       6       punct   _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。|ClauseHead=6

ginzame command provides tokenization function likeMeCab.The output format ofginzame is almost same asmecab, but the lastpronunciation field is always '*'.

$ginzame銀座でランチをご一緒しましょう。銀座名詞,固有名詞,地名,一般,*,*,銀座,ギンザ,*で助詞,格助詞,*,*,*,*,で,デ,*ランチ名詞,普通名詞,一般,*,*,*,ランチ,ランチ,*を助詞,格助詞,*,*,*,*,を,ヲ,*ご接頭辞,*,*,*,*,*,御,ゴ,*一緒名詞,普通名詞,サ変可能,*,*,*,一緒,イッショ,*し動詞,非自立可能,*,*,サ行変格,連用形-一般,為る,シ,*ましょう助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,*。補助記号,句点,*,*,*,*,。,。,*EOS

The format of spaCy's JSON is available by specifying-f 3 or-f json forginza command.

$ginza -f json銀座でランチをご一緒しましょう。[ {  "paragraphs": [   {    "raw": "銀座でランチをご一緒しましょう。",    "sentences": [     {      "tokens": [       {"id": 1, "orth": "銀座", "tag": "名詞-固有名詞-地名-一般", "pos": "PROPN", "lemma": "銀座", "head": 5, "dep": "obl", "ner": "B-City"},       {"id": 2, "orth": "で", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "で", "head": -1, "dep": "case", "ner": "O"},       {"id": 3, "orth": "ランチ", "tag": "名詞-普通名詞-一般", "pos": "NOUN", "lemma": "ランチ", "head": 3, "dep": "obj", "ner": "O"},       {"id": 4, "orth": "を", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "を", "head": -1, "dep": "case", "ner": "O"},       {"id": 5, "orth": "ご", "tag": "接頭辞", "pos": "NOUN", "lemma": "ご", "head": 1, "dep": "compound", "ner": "O"},       {"id": 6, "orth": "一緒", "tag": "名詞-普通名詞-サ変可能", "pos": "VERB", "lemma": "一緒", "head": 0, "dep": "ROOT", "ner": "O"},       {"id": 7, "orth": "し", "tag": "動詞-非自立可能", "pos": "AUX", "lemma": "する", "head": -1, "dep": "advcl", "ner": "O"},       {"id": 8, "orth": "ましょう", "tag": "助動詞", "pos": "AUX", "lemma": "ます", "head": -2, "dep": "aux", "ner": "O"},       {"id": 9, "orth": "。", "tag": "補助記号-句点", "pos": "PUNCT", "lemma": "。", "head": -3, "dep": "punct", "ner": "O"}      ]     }    ]   }  ] }]

If you want to usecabocha -f1 (lattice style) like output, add-f 1 or-f cabocha option toginza command.This option's format is almost same ascabocha -f1 but thefunc_index field (after the slash) is slightly different.Ourfunc_index field indicates the boundary where the自立語 ends in each文節 (and the機能語 might start from there).And the functional token filter is also slightly different betweencabocha -f1 and 'ginza -f cabocha.

$ginza -f cabocha銀座でランチをご一緒しましょう。* 0 2D 0/1 0.000000銀座名詞,固有名詞,地名,一般,,銀座,ギンザ,*B-Cityで助詞,格助詞,*,*,,で,デ,*O* 1 2D 0/1 0.000000ランチ名詞,普通名詞,一般,*,,ランチ,ランチ,*Oを助詞,格助詞,*,*,,を,ヲ,*O* 2 -1D 0/2 0.000000ご接頭辞,*,*,*,,ご,ゴ,*O一緒名詞,普通名詞,サ変可能,*,,一緒,イッショ,*Oし動詞,非自立可能,*,*,サ行変格,連用形-一般,する,シ,*Oましょう助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,*O。補助記号,句点,*,*,,。,。,*OEOS

Multi-processing (Experimental)

We added-p NUM_PROCESS option from GiNZA v3.0.Please specify the number of analyzing processes toNUM_PROCESS.You might want to use all the cpu cores for GiNZA, then executeginza -p 0.The memory requirement is about 130MB/process (to be improved).

Coding example

Following steps shows dependency parsing results with sentence boundary 'EOS'.

importspacynlp=spacy.load('ja_ginza_electra')doc=nlp('銀座でランチをご一緒しましょう。')forsentindoc.sents:fortokeninsent:print(token.i,token.orth_,token.lemma_,token.norm_,token.morph.get("Reading"),token.pos_,token.morph.get("Inflection"),token.tag_,token.dep_,token.head.i,        )print('EOS')

User Dictionary

The user dictionary files should be set touserDict field ofsudachi.json in the installed package directory ofja_ginza_dict package.

Please read the official documents to compile user dictionaries withsudachipy command.SudachiPy - User defined DictionarySudachi User Dictionary Construction (Japanese Only)

Releases

version 5.x

ginza-5.2.0

  • 2024-03-31
  • Require python>=3.8
  • Migrate to spaCy v3.7
  • New functionality
    • add Japanese clause recognition API (experimental)

ginza-5.1.3

  • 2023-09-25
  • Migrate to spaCy v3.6
  • Beta release ofja_ginza_bert_large

ginza-5.1.2

  • 2022-03-12
  • Migrate to spaCy v3.4

ginza-5.1.1

  • 2022-03-12
  • Improvements
    • auto deploy for pypi by @nimiusrd in #184
    • modify github actions: trigger by tagging, stop uploading test pypi by @r-terada in #233

ginza-5.1.0

  • 2021-12-10, Euclase
  • Important changes
    • Upgrade: spaCy v3.2 and Sudachi.rs(SudachiPy v0.6.2)
    • Change token information fields #208 #209
      • doc.user_data["reading_forms"][token.i] ->token.morph.get("Reading")
      • doc.user_data["inflections"][token.i] ->token.morph.get("Inflection")
      • force_using_normalized_form_as_lemma(True) ->token.norm_
    • All spaCy models, including non-Japanese, are now available with the ginza command #217
      • Download and analyze the model at once by specifying the model name in the following form #219
      • ginza -m en_core_web_md
    • Changeginza --require_gpu andginza -g to take agpu_id argument
      • The defaultgpu_id value is-1 which uses only CPUs
    • ginza -f json option always analyze the line which starts with# regardless the option value of-c. #215
  • Improvements
    • Batch analysis processing speeds up by 50-60% in GPU environment and 10-40% in CPU environment
    • Improved processing efficiency of parallel execution options (ginza -p {n_process} andginzame) of ginza command #204
    • add tests #198 #210 #214
    • add benchmark #207 #220

ginza-5.0.3

  • 2021-10-15
  • Bug fix
    • Bunsetu span should not cross the sentence boundary #195

ginza-5.0.2

  • 2021-09-06
  • Bug fix
    • Command Line -s option and set_split_mode() not working in v5.0.x #185

ginza-5.0.1

  • 2021-08-26
  • Bug fix
    • ginzame not woriking in ginza ver. 5 #179
    • Command Line -d option not working in v5.0.0 #178
  • Improvement
    • acceptja-ginza andja-ginza-electra for-m option ofginza command

ginza-5.0.0

  • 2021-08-26, Demantoid
  • Important changes
    • Upgrade spaCy to v3
      • Release transformer-basedja-ginza-electra model
      • Improve UPOS accuracy of the standardja-ginza model by addingmorphologizer to the tail of spaCy pipleline
    • Need to insrtall analysis model along withginza package
      • High accuracy model (>=16GB memory needed)
        • pip install -U ginza ja-ginza-electra
      • Speed oriented model
        • pip install -U ginza ja-ginza
    • Change component names ofCompoundSplitter andBunsetuRecognizer tocompound_splitter andbunsetu_recognizer respectively
    • Also seespaCy v3 Backwards Incompatibilities
  • Improvements
    • Add command line options
      • -n
        • Force using SudachiPy'snormalized_form asToken.lemma_
      • -m (ja_ginza|ja_ginza_electra)
        • Select model package
    • Revise ENE category name
      • Degital_Game toDigital_Game

version 4.x

ginza-4.0.6

  • 2021-06-01
  • Bug fix
    • Issue #160: IndexError: list assignment index out of range for empty string

ginza-4.0.5

  • 2020-10-01
  • Improvements
    • Add-d option, which disables spaCy's sentence separator, toginza command line tool

ginza-4.0.4

  • 2020-09-11
  • Improvements
    • ginza command line tool works correctly without BunsetuRecognizer in the pipeline

ginza-4.0.3

  • 2020-09-10
  • Improve bunsetu head identification accuracy over inconsistent deps in ent spans

ginza-4.0.2

  • 2020-09-04
  • Improvements
    • Serialization ofCompoundSplitter fornlp.to_disk()
    • Bunsetu span detection accuracy

ginza-4.0.1

  • 2020-08-30
  • Debug
    • Add type arguments for singledispatch register annotations (for Python 3.6)

ginza-4.0.0

  • 2020-08-16, Chrysoberyl
  • Important changes
    • Replace Japanese model withspacy.lang.ja of spaCy v2.3
      • Replace values ofToken.lemma_ with the output of SudachiPy'sMorpheme.dictionary_form()
    • Replace ja_ginza_dict with official SudachiDict-core package
      • You can deleteja_ginza_dict package safety
    • Change options and misc field contents of output of command line tool
      • delete use_sentence_separator(-s)
      • NE(OntoNotes) BI labels asB-GPE
      • Add subfields: Reading, Inf(inflection) and ENE(Extended NE)
    • ObsoleteToken._.* and add some entries forDoc.user_data[] and accessors
      • inflections (ginza.inflection(Token))
      • reading_forms (ginza.reading_form(Token))
      • bunsetu_bi_labels (ginza.bunsetu_bi_label(Token))
      • bunsetu_position_types (ginza.bunsetu_position_type(Token))
      • bunsetu_heads (ginza.is_bunsetu_head(Token))
    • Change pipeline architecture
      • JapaneseCorrector was obsoleted
      • Add CompoundSplitter and BunsetuRecognizer
    • Upgrade UD_JAPANESE-BCCWJ to v2.6
    • Change word2vec to chiVe mc90
  • API Changes
    • Add bunsetu-unit APIs (from ginza import *)
      • bunsetu(Token)
      • phrase(Token)
      • sub_phrases(Token)
      • phrases(Span)
      • bunsetu_spans(Span)
      • bunsetu_phrase_spans(Span)
      • bunsetu_head_list(Span)
      • bunsetu_head_tokens(Span)
      • bunsetu_bi_labels(Span)
      • bunsetu_position_types(Span)

version 3.x

ginza-3.1.2

  • 2020-02-12
  • Debug
    • Fix: degrade of cabocha mode

ginza-3.1.1

  • 2020-01-19
  • API Changes
    • Extension fields
      • The values ofToken._.sudachi field would be set after callingSudachipyTokenizer.set_enable_ex_sudachi(True), to avoid serializtion errors
importspacyimportpicklenlp=spacy.load('ja_ginza')doc1=nlp('This example will be serialized correctly.')doc1.to_bytes()withopen('sample1.pickle','wb')asf:pickle.dump(doc1,f)nlp.tokenizer.set_enable_ex_sudachi(True)doc2=nlp('This example will cause a serialization error.')doc2.to_bytes()withopen('sample2.pickle','wb')asf:pickle.dump(doc2,f)

ginza-3.1.0

  • 2020-01-16
  • Important changes
    • Distributeja_ginza_dict from PyPI
  • API Changes
    • commands
      • ginza andginzame
        • add-i option to initialize the files ofja_ginza_dict

ginza-3.0.0

  • 2020-01-15, Benitoite
  • Important changes
    • Distributeginza andja_ginza from PyPI
      • Simple installation;pip install ginza, and runginza
      • The model package,ja_ginza, is also available from PyPI.
    • Model improvements
      • Change NER training data-set to GSK2014-A (2019) BCCWJ edition
        • Improved accuracy of NER
        • token.ent_type_ value is changed toSekine's Extended Named Entity Hierarchy
          • AddENE7 attribute to the last field of the output ofginza
        • MoveOntoNotes5 -based label totoken._.ne
          • We extended the OntoNotes5 named entity labels withPHONE,EMAIL,URL, andPET_NAME
      • Overall accuracy is improved by executingspacy pretrain over 100 epochs
        • Multi-task learning ofspacy train effectively working on UD Japanese BCCWJ
      • The newestSudachiDict_core-20191224
    • ginzame
      • Executesudachipy bymultiprocessing.Pool and output results withmecab like format
      • Nowsudachipy command requires additional SudachiDict package installation
  • Breaking API Changes
    • commands
      • ginza (ginza.command_line.main_ginza)
        • change optionmode tosudachipy_mode
        • drop options:disable_pipes andrecreate_corrector
        • add options:hash_comment,parallel,files
        • addmecab to the choices for the argument of-f option
        • addparallel NUM_PROCESS option (EXPERIMENTAL)
        • addENE7 attribute to conllu miscellaneous field
          • ginza.ent_type_mapping.ENE_NE_MAPPING is used to convertENE7 label toNE
      • addginzame (ginza.command_line.main_ginzame)
        • a multi-process tokenizer providingmecab like output format
    • spaCy field extensions
      • addtoken._.ne for ner label
    • ginza/sudachipy_tokenizer.py
      • changeSudachiTokenizer toSudachipyTokenizer
      • useSUDACHI_DEFAULT_SPLIT_MODE instead ofSUDACHI_DEFAULT_SPLITMODE orSUDACHI_DEFAULT_MODE
  • Dependencies
    • upgradespacy to v2.2.3
    • upgradesudachipy to v0.4.2

version 2.x

ginza-2.2.1

  • 2019-10-28
  • Improvements
    • JapaneseCorrector can merge theas_* type dependencies completely
  • Bug fixes
    • command line tool failed at the specific situations

ginza-2.2.0

  • 2019-10-04, Ametrine
  • Important changes
    • split_mode has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)
      • This bug causedsplit_mode incompatibility between the training phase and theginza command.
      • split_mode was set to 'B' for training phase and python APIs, but 'C' forginza command.
      • We fixed this bug by setting the defaultsplit_mode to 'C' entirely.
      • This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
  • New features
    • Add-f and--output-format option toginza command:
    • Add custom token fields:
      • bunsetu_index : bunsetu index starting from 0
      • reading: reading of token (not a pronunciation)
      • sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
  • Performance improvements
    • Tokenizer
      • Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
      • Use Cythonized SudachiPy (v0.4.0)
    • Dependency parser
      • Applyspacy pretrain command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC.
      • Apply multitask objectives by using-pt 'tag,dep' option ofspacy train
    • New model file
      • ja_ginza-2.2.0.tar.gz

ginza-2.0.0

  • 2019-07-08
  • Addginza command
    • runginza from the console
  • Change package structure
    • module package asginza
    • language model package asja_ginza
    • spacy.lang.ja is overridden byginza
  • Removesudachipy related directories
    • SudachiPy and its dictionary are installed viapip duringginza installation
  • User dictionary available
  • Token extension fields
    • Added
      • token._.bunsetu_bi_label,token._.bunsetu_position_type
    • Remained
      • token._.inf
    • Removed
      • pos_detail (same value is set totoken.tag_)

version 1.x

ja_ginza_nopn-1.0.2

  • 2019-04-07
  • Set depending token index of root as 0 to meet with conllu format definitions

ja_ginza_nopn-1.0.1

  • 2019-04-02
  • Add new Japanese era 'reiwa' to system_core.dic.

ja_ginza_nopn-1.0.0

  • 2019-04-01
  • First release version

Development Environment

Development set up

1. Clone from github

$git clone'https://github.com/megagonlabs/ginza.git'

2. Run python setup.py

For normal environment:

$python setup.py develop

3. Set up system.dic

Copysystem.dic from installed package directory ofja_ginza_dict to./ja_ginza_dict/sudachidict/.

Training models

The analysis model of GiNZA is trained byspacy train command.

$python -m spacy train ja ja_ginza-4.0.0 corpus/ja_ginza-ud-train.json corpus/ja_ginza-ud-dev.json -b ja_vectors_chive_mc90_35k/ -ovl 0.3 -n 100 -m meta.json.ginza -V 4.0.0

Run tests

Ginza uses the pytest framework for testing, and you can run the tests viasetup.py without install test requirements explicitly.Some tests depends on the ginza default models (ja-ginza,ja-ginza-electra), so install them before the tests is needed.

$pip install ja-ginza ja-ginza-electra$pip install -e.#fulltest$python setup.pytest#test single file$python setup.pytest --addopts ginza/tests/test_analyzer.py

About

A Japanese NLP Library using spaCy as framework based on Universal Dependencies

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors13


[8]ページ先頭

©2009-2025 Movatter.jp