WorksApplications/SudachiTraPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star79

Japanese tokenizer for Transformers

License

Apache-2.0 license

79 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
evaluation		evaluation
misc		misc
pretraining/bert		pretraining/bert
sudachitra		sudachitra
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Sudachi Transformers (chiTra)

chiTraは事前学習済みの大規模な言語モデルとTransformers 向けの日本語形態素解析器を提供します。 / chiTra provides the pre-trained language models and a Japanese tokenizer forTransformers.

chiTraはSudachi Transformersの略称です。 / chiTra stands for Sudachi Transformers.

事前学習済みモデル / Pretrained Model

公開データはOpen Data Sponsorship Program を使用してAWSでホストされています。 / Datas are generously hosted by AWS with theirOpen Data Sponsorship Program.

Version	Normalized	SudachiTra	Sudachi	SudachiDict	Text	Pretrained Model
v1.0	normalized_and_surface	v0.1.7	0.6.2	20211220-core	NWJC (109GB)	395 MB (tar.gz)
v1.1	normalized_nouns	v0.1.8	0.6.6	20220729-core	NWJC with additional cleaning (79GB)	396 MB (tar.gz)

特長 / Features

大規模テキストによる学習 / Training on large texts
- 国語研日本語ウェブコーパス (NWJC) をつかってモデルを学習することで多様な表現とさまざまなドメインに対応しています / Models are trained on NINJAL Web Japanese Corpus (NWJC) to support a wide variety of expressions and domains.
Sudachi の利用 / Using Sudachi
- 形態素解析器 Sudachi を利用することで表記ゆれによる弊害を抑えています / By using the morphological analyzer Sudachi, reduce the negative effects of various notations.

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

事前準備 / Requirements

$ pip install sudachitra$ wget https://sudachi.s3.ap-northeast-1.amazonaws.com/chitra/chiTra-1.1.tar.gz$ tar -zxvf chiTra-1.1.tar.gz

モデルの読み込み / Load the model

>>>fromsudachitra.tokenization_bert_sudachipyimportBertSudachipyTokenizer>>>fromtransformersimportBertModel>>>tokenizer=BertSudachipyTokenizer.from_pretrained('chiTra-1.1')>>>tokenizer.tokenize("選挙管理委員会とすだち")['選挙','##管理','##委員会','と','酢','##橘']>>>model=BertModel.from_pretrained('chiTra-1.1')>>>model(**tokenizer("まさにオールマイティーな商品だ。",return_tensors="pt")).last_hidden_statetensor([[[0.8583,-1.1752,-0.7987,  ...,-1.1691,-0.8355,3.4678],         [0.0220,1.1702,-2.3334,  ...,0.6673,-2.0774,2.7731],         [0.0894,-1.3009,3.4650,  ...,-0.1140,0.1767,1.9859],         ...,         [-0.4429,-1.6267,-2.1493,  ...,-1.7801,-1.8009,2.5343],         [1.7204,-1.0540,-0.4362,  ...,-0.0228,0.5622,2.5800],         [1.1125,-0.3986,1.8532,  ...,-0.8021,-1.5888,2.9520]]],grad_fn=<NativeLayerNormBackward0>)

インストール / Installation

$ pip install sudachitra

デフォルトのSudachi dictionary はSudachiDict-core を使用します。 / The defaultSudachi dictionary isSudachiDict-core.

SudachiDict-small やSudachiDict-full など他の辞書をインストールして使用することもできます。 / You can use other dictionaries, such asSudachiDict-small andSudachiDict-full .
その場合は以下のように使いたい辞書をインストールしてください。 / In such cases, you need to install the dictionaries.
事前学習済みモデルを使いたい場合はcore辞書を使用して学習されていることに注意してください。 / If you want to use a pre-trained model, note that it is trained with SudachiDict-core.

$ pip install sudachidict_small sudachidict_full

事前学習 / Pretraining

事前学習方法の詳細はpretraining/bert/README.md を参照ください。 / Please refer topretraining/bert/README.md.

開発者向け / For Developers

TBD

ライセンス / License

"chiTra"はApache License, Version 2.0 で国立国語研究所及び株式会社ワークスアプリケーションズによって提供されています。 / "chiTra" is distributed byNational Institute for Japanese Language and Linguistics andWorks Applications Co.,Ltd. underApache License, Version 2.0.

連絡先 / Contact us

質問があれば、issueやslackをご利用ください。 / Open an issue, or come to our Slack workspace for questions and discussion.

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。 / We have a Slack workspace for developers and users to ask questions and discuss.https://sudachi-dev.slack.com/ (こちらから招待を受けてください) /https://sudachi-dev.slack.com/ (Get invitationhere )

chiTraの引用 / Citing chiTra

chiTraについての論文を発表しています。 / We have published a following paper about chiTra;

勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸, 単語正規化による表記ゆれに頑健な BERT モデルの構築. 言語処理学会第28回年次大会, 2022.

chiTraを論文や書籍、サービスなどで引用される際には、以下のBibTexをご利用ください。 / When citing chiTra in papers, books, or services, please use the follow BibTex entries;

@INPROCEEDINGS{katsuta2022chitra,    author    = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},    title     = {単語正規化による表記ゆれに頑健な BERT モデルの構築},    booktitle = "言語処理学会第28回年次大会(NLP2022)",    year      = "2022",    pages     = "",    publisher = "言語処理学会",}

実験に使用したモデル / Model used for experiment

「単語正規化による表記ゆれに頑健なBERTモデルの構築」の実験において使用したモデルを以下で公開しています。/ The model used in the experiment of "単語正規化による表記ゆれに頑健なBERTモデルの構築" is published below.

Normalized	Text	Pretrained Model
surface	Wiki-40B	tar.gz
normalized_and_surface	Wiki-40B	tar.gz
normalized_conjugation	Wiki-40B	tar.gz
normalized	Wiki-40B	tar.gz

Enjoy chiTra!

About

Japanese tokenizer for Transformers

Releases10

v0.1.9 Latest

Dec 15, 2023

+ 9 releases

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Sudachi Transformers (chiTra)

事前学習済みモデル / Pretrained Model

特長 / Features

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

インストール / Installation

事前学習 / Pretraining

開発者向け / For Developers

ライセンス / License

連絡先 / Contact us

chiTraの引用 / Citing chiTra

実験に使用したモデル / Model used for experiment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases10

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors6

Languages

Movatterモバイル変換

Uh oh!

License

WorksApplications/SudachiTra

Folders and files

Latest commit

History

Repository files navigation

Sudachi Transformers (chiTra)

事前学習済みモデル / Pretrained Model

特長 / Features

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

インストール / Installation

事前学習 / Pretraining

開発者向け / For Developers

ライセンス / License

連絡先 / Contact us

chiTraの引用 / Citing chiTra

実験に使用したモデル / Model used for experiment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases10

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors6

Languages

Packages