Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Python port of Moses tokenizer, truecaser and normalizer

License

NotificationsYou must be signed in to change notification settings

alvations/sacremoses

 
 

Repository files navigation

Build StatusBuild statusDownloads

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version issacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>>fromsacremosesimportMosesTokenizer,MosesDetokenizer>>>mt=MosesTokenizer(lang='en')>>>text='This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>expected_tokenized='This , is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>tokenized_text=mt.tokenize(text,return_str=True)>>>tokenized_text==expected_tokenizedTrue>>>mt,md=MosesTokenizer(lang='en'),MosesDetokenizer(lang='en')>>>sent="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?">>>expected_tokens= ['This','ain','&apos;t','funny','.','It','&apos;s','actually','hillarious',',','yet','double','Ls','.','&#124;','&#91;','&#93;','&lt;','&gt;','&#91;','&#93;','&amp;','You','&apos;re','gonna','shake','it','off','?','Don','&apos;t','?']>>>expected_detokens="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?">>>mt.tokenize(sent)==expected_tokensTrue>>>md.detokenize(tokens)==expected_detokensTrue

Truecaser

>>>fromsacremosesimportMosesTruecaser,MosesTokenizer# Train a new truecaser from a 'big.txt' file.>>>mtr=MosesTruecaser()>>>mtok=MosesTokenizer(lang='en')# Save the truecase model to 'big.truecasemodel' using `save_to`>>tokenized_docs= [mtok.tokenize(line)forlineinopen('big.txt')]>>>mtr.train(tokenized_docs,save_to='big.truecasemodel')# Save the truecase model to 'big.truecasemodel' after training# (just in case you forgot to use `save_to`)>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.save_model('big.truecasemodel')# Truecase a string after training a model.>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']# Loads a model and truecase a string using trained model.>>>mtr=MosesTruecaser('big.truecasemodel')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True)'the ADVENTURES OF SHERLOCK HOLMES'>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True,use_known=True)'the adventures of Sherlock Holmes'

Normalizer

>>>fromsacremosesimportMosesPunctNormalizer>>>mpn=MosesPunctNormalizer()>>>mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version0.0.42, the pipeline feature for CLI is introduced, thus thereare global options that should be set first before calling the commands:

  • language
  • processes
  • encoding
  • quiet
$ pip install -U sacremoses>=0.0.42$ sacremoses --helpUsage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...Options:  -l, --language TEXT      Use language specific rules when tokenizing  -j, --processes INTEGER  No. of processes.  -e, --encoding TEXT      Specify encoding of file.  -q, --quiet              Disable progress bar.  --version                Show the version and exit.  -h, --help               Show this message and exit.Commands:  detokenize  detruecase  normalize  tokenize  train-truecase  truecase

Pipeline

Example to chain the following commands:

  • normalize with-c option to remove control characters.
  • tokenize with-a option for aggressive dash split rules.
  • truecase with-a option to indicate that model is for ASR
    • ifbig.truemodel exists, load the model with-m option,
    • otherwise train a model and save it with-m option tobig.truemodel file.
  • save the output to console to thebig.txt.norm.tok.true file.
cat big.txt| sacremoses -l en -j 4 \    normalize -c tokenize -a truecase -a -m big.truemodel \> big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --helpUsage: sacremoses tokenize [OPTIONS]Options:  -a, --aggressive-dash-splits   Triggers dash split rules.  -x, --xml-escape               Escape special charactersfor XML.  -p, --protected-patterns TEXT  Specify file with patters to be protectedin                                 tokenisation.  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,                                 add prefixes to the default ones from the                                 specified language.  -h, --help                     Show this message and exit. $ sacremoses -l en -j 4 tokenize< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns $ sacremoses -l en -j 4 tokenize-p basic-protected-patterns< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --helpUsage: sacremoses detokenize [OPTIONS]Options:  -x, --xml-unescape  Unescape special charactersfor XML.  -h, --help          Show this message and exit. $ sacremoses -l en -j 4 detokenize< big.txt.tok> big.txt.tok.detok100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --helpUsage: sacremoses truecase [OPTIONS]Options:  -m, --modelfile TEXT            Filename to save/load the modelfile.                                  [required]  -a, --is-asr                    A flag to indicate that model isfor ASR.  -p, --possibly-use-first-token  Use the first token as part of truecase                                  training.  -h, --help                      Show this message and exit.$ sacremoses -j 4 truecase -m big.model< big.txt.tok> big.txt.tok.true100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --helpUsage: sacremoses detruecase [OPTIONS]Options:  -j, --processes INTEGER  No. of processes.  -a, --is-headline        Whether the file are headlines.  -e, --encoding TEXT      Specify encoding of file.  -h, --help               Show this message and exit.$ sacremoses -j 4 detruecase< big.txt.tok.true> big.txt.tok.true.detrue100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --helpUsage: sacremoses normalize [OPTIONS]Options:  -q, --normalize-quote-commas  Normalize quotations and commas.  -d, --normalize-numbers       Normalize number.  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE                                normalization.  -c, --remove-control-chars    Remove control characters AFTER normalization.  -h, --help                    Show this message and exit.$ sacremoses -j 4 normalize< big.txt> big.txt.norm100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

About

Python port of Moses tokenizer, truecaser and normalizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python84.7%
  • Emacs Lisp9.8%
  • Smalltalk1.1%
  • Ruby1.0%
  • NewLisp0.9%
  • Perl0.8%
  • Other1.7%

[8]ページ先頭

©2009-2025 Movatter.jp