alvations/sacremosesPublic

forked fromhplt-project/sacremoses

NotificationsYou must be signed in to change notification settings
Fork0
Star114

Python port of Moses tokenizer, truecaser and normalizer

License

MIT license

114 stars 60 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
sacremoses		sacremoses
.appveyor.yml		.appveyor.yml
.travis.yml		.travis.yml
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Sacremoses

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version issacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>>fromsacremosesimportMosesTokenizer,MosesDetokenizer>>>mt=MosesTokenizer(lang='en')>>>text='This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>expected_tokenized='This , is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>tokenized_text=mt.tokenize(text,return_str=True)>>>tokenized_text==expected_tokenizedTrue>>>mt,md=MosesTokenizer(lang='en'),MosesDetokenizer(lang='en')>>>sent="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?">>>expected_tokens= ['This','ain','&apos;t','funny','.','It','&apos;s','actually','hillarious',',','yet','double','Ls','.','&#124;','&#91;','&#93;','&lt;','&gt;','&#91;','&#93;','&amp;','You','&apos;re','gonna','shake','it','off','?','Don','&apos;t','?']>>>expected_detokens="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?">>>mt.tokenize(sent)==expected_tokensTrue>>>md.detokenize(tokens)==expected_detokensTrue

Truecaser

>>>fromsacremosesimportMosesTruecaser,MosesTokenizer# Train a new truecaser from a 'big.txt' file.>>>mtr=MosesTruecaser()>>>mtok=MosesTokenizer(lang='en')# Save the truecase model to 'big.truecasemodel' using `save_to`>>tokenized_docs= [mtok.tokenize(line)forlineinopen('big.txt')]>>>mtr.train(tokenized_docs,save_to='big.truecasemodel')# Save the truecase model to 'big.truecasemodel' after training# (just in case you forgot to use `save_to`)>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.save_model('big.truecasemodel')# Truecase a string after training a model.>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']# Loads a model and truecase a string using trained model.>>>mtr=MosesTruecaser('big.truecasemodel')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True)'the ADVENTURES OF SHERLOCK HOLMES'>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True,use_known=True)'the adventures of Sherlock Holmes'

Normalizer

>>>fromsacremosesimportMosesPunctNormalizer>>>mpn=MosesPunctNormalizer()>>>mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version0.0.42, the pipeline feature for CLI is introduced, thus thereare global options that should be set first before calling the commands:

language
processes
encoding
quiet

$ pip install -U sacremoses>=0.0.42$ sacremoses --helpUsage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...Options:  -l, --language TEXT      Use language specific rules when tokenizing  -j, --processes INTEGER  No. of processes.  -e, --encoding TEXT      Specify encoding of file.  -q, --quiet              Disable progress bar.  --version                Show the version and exit.  -h, --help               Show this message and exit.Commands:  detokenize  detruecase  normalize  tokenize  train-truecase  truecase

Pipeline

Example to chain the following commands:

normalize with-c option to remove control characters.
tokenize with-a option for aggressive dash split rules.
truecase with-a option to indicate that model is for ASR
- ifbig.truemodel exists, load the model with-m option,
- otherwise train a model and save it with-m option tobig.truemodel file.
save the output to console to thebig.txt.norm.tok.true file.

cat big.txt| sacremoses -l en -j 4 \    normalize -c tokenize -a truecase -a -m big.truemodel \> big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --helpUsage: sacremoses tokenize [OPTIONS]Options:  -a, --aggressive-dash-splits   Triggers dash split rules.  -x, --xml-escape               Escape special charactersfor XML.  -p, --protected-patterns TEXT  Specify file with patters to be protectedin                                 tokenisation.  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,                                 add prefixes to the default ones from the                                 specified language.  -h, --help                     Show this message and exit. $ sacremoses -l en -j 4 tokenize< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns $ sacremoses -l en -j 4 tokenize-p basic-protected-patterns< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --helpUsage: sacremoses detokenize [OPTIONS]Options:  -x, --xml-unescape  Unescape special charactersfor XML.  -h, --help          Show this message and exit. $ sacremoses -l en -j 4 detokenize< big.txt.tok> big.txt.tok.detok100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --helpUsage: sacremoses truecase [OPTIONS]Options:  -m, --modelfile TEXT            Filename to save/load the modelfile.                                  [required]  -a, --is-asr                    A flag to indicate that model isfor ASR.  -p, --possibly-use-first-token  Use the first token as part of truecase                                  training.  -h, --help                      Show this message and exit.$ sacremoses -j 4 truecase -m big.model< big.txt.tok> big.txt.tok.true100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --helpUsage: sacremoses detruecase [OPTIONS]Options:  -j, --processes INTEGER  No. of processes.  -a, --is-headline        Whether the file are headlines.  -e, --encoding TEXT      Specify encoding of file.  -h, --help               Show this message and exit.$ sacremoses -j 4 detruecase< big.txt.tok.true> big.txt.tok.true.detrue100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --helpUsage: sacremoses normalize [OPTIONS]Options:  -q, --normalize-quote-commas  Normalize quotations and commas.  -d, --normalize-numbers       Normalize number.  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE                                normalization.  -c, --remove-control-chars    Remove control characters AFTER normalization.  -h, --help                    Show this message and exit.$ sacremoses -j 4 normalize< big.txt> big.txt.norm100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

About

Python port of Moses tokenizer, truecaser and normalizer

Releases

No releases published

Packages

No packages published

Languages

Python84.7%
Emacs Lisp9.8%
Smalltalk1.1%
Ruby1.0%
NewLisp0.9%
Perl0.8%
Other1.7%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Sacremoses

License

Install

Usage (Python)

Tokenizer and Detokenizer

Truecaser

Normalizer

Usage (CLI)

Pipeline

Tokenizer

Detokenizer

Truecase

Detruecase

Normalize

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

alvations/sacremoses

Folders and files

Latest commit

History

Repository files navigation

Sacremoses

License

Install

Usage (Python)

Tokenizer and Detokenizer

Truecaser

Normalizer

Usage (CLI)

Pipeline

Tokenizer

Detokenizer

Truecase

Detruecase

Normalize

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages