You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version issacremoses==0.0.40.
Usage (Python)
Tokenizer and Detokenizer
>>>fromsacremosesimportMosesTokenizer,MosesDetokenizer>>>mt=MosesTokenizer(lang='en')>>>text='This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>expected_tokenized='This , is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'>>>tokenized_text=mt.tokenize(text,return_str=True)>>>tokenized_text==expected_tokenizedTrue>>>mt,md=MosesTokenizer(lang='en'),MosesDetokenizer(lang='en')>>>sent="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?">>>expected_tokens= ['This','ain',''t','funny','.','It',''s','actually','hillarious',',','yet','double','Ls','.','|','[',']','<','>','[',']','&','You',''re','gonna','shake','it','off','?','Don',''t','?']>>>expected_detokens="This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?">>>mt.tokenize(sent)==expected_tokensTrue>>>md.detokenize(tokens)==expected_detokensTrue
Truecaser
>>>fromsacremosesimportMosesTruecaser,MosesTokenizer# Train a new truecaser from a 'big.txt' file.>>>mtr=MosesTruecaser()>>>mtok=MosesTokenizer(lang='en')# Save the truecase model to 'big.truecasemodel' using `save_to`>>tokenized_docs= [mtok.tokenize(line)forlineinopen('big.txt')]>>>mtr.train(tokenized_docs,save_to='big.truecasemodel')# Save the truecase model to 'big.truecasemodel' after training# (just in case you forgot to use `save_to`)>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.save_model('big.truecasemodel')# Truecase a string after training a model.>>>mtr=MosesTruecaser()>>>mtr.train('big.txt')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']# Loads a model and truecase a string using trained model.>>>mtr=MosesTruecaser('big.truecasemodel')>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")['the','adventures','of','Sherlock','Holmes']>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True)'the ADVENTURES OF SHERLOCK HOLMES'>>>mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES",return_str=True,use_known=True)'the adventures of Sherlock Holmes'
Normalizer
>>>fromsacremosesimportMosesPunctNormalizer>>>mpn=MosesPunctNormalizer()>>>mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'
Usage (CLI)
Since version0.0.42, the pipeline feature for CLI is introduced, thus thereare global options that should be set first before calling the commands:
language
processes
encoding
quiet
$ pip install -U sacremoses>=0.0.42$ sacremoses --helpUsage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...Options: -l, --language TEXT Use language specific rules when tokenizing -j, --processes INTEGER No. of processes. -e, --encoding TEXT Specify encoding of file. -q, --quiet Disable progress bar. --version Show the version and exit. -h, --help Show this message and exit.Commands: detokenize detruecase normalize tokenize train-truecase truecase
Pipeline
Example to chain the following commands:
normalize with-c option to remove control characters.
tokenize with-a option for aggressive dash split rules.
truecase with-a option to indicate that model is for ASR
ifbig.truemodel exists, load the model with-m option,
otherwise train a model and save it with-m option tobig.truemodel file.
save the output to console to thebig.txt.norm.tok.true file.
cat big.txt| sacremoses -l en -j 4 \ normalize -c tokenize -a truecase -a -m big.truemodel \> big.txt.norm.tok.true
Tokenizer
$ sacremoses tokenize --helpUsage: sacremoses tokenize [OPTIONS]Options: -a, --aggressive-dash-splits Triggers dash split rules. -x, --xml-escape Escape special charactersfor XML. -p, --protected-patterns TEXT Specify file with patters to be protectedin tokenisation. -c, --custom-nb-prefixes TEXT Specify a custom non-breaking prefixes file, add prefixes to the default ones from the specified language. -h, --help Show this message and exit. $ sacremoses -l en -j 4 tokenize< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns $ sacremoses -l en -j 4 tokenize-p basic-protected-patterns< big.txt> big.txt.tok100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s
Detokenizer
$ sacremoses detokenize --helpUsage: sacremoses detokenize [OPTIONS]Options: -x, --xml-unescape Unescape special charactersfor XML. -h, --help Show this message and exit. $ sacremoses -l en -j 4 detokenize< big.txt.tok> big.txt.tok.detok100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]
Truecase
$ sacremoses truecase --helpUsage: sacremoses truecase [OPTIONS]Options: -m, --modelfile TEXT Filename to save/load the modelfile. [required] -a, --is-asr A flag to indicate that model isfor ASR. -p, --possibly-use-first-token Use the first token as part of truecase training. -h, --help Show this message and exit.$ sacremoses -j 4 truecase -m big.model< big.txt.tok> big.txt.tok.true100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]
Detruecase
$ sacremoses detruecase --helpUsage: sacremoses detruecase [OPTIONS]Options: -j, --processes INTEGER No. of processes. -a, --is-headline Whether the file are headlines. -e, --encoding TEXT Specify encoding of file. -h, --help Show this message and exit.$ sacremoses -j 4 detruecase< big.txt.tok.true> big.txt.tok.true.detrue100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]
Normalize
$ sacremoses normalize --helpUsage: sacremoses normalize [OPTIONS]Options: -q, --normalize-quote-commas Normalize quotations and commas. -d, --normalize-numbers Normalize number. -p, --replace-unicode-puncts Replace unicode punctuations BEFORE normalization. -c, --remove-control-chars Remove control characters AFTER normalization. -h, --help Show this message and exit.$ sacremoses -j 4 normalize< big.txt> big.txt.norm100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]
About
Python port of Moses tokenizer, truecaser and normalizer