- Notifications
You must be signed in to change notification settings - Fork1
PmatchContainer
A class for performing pattern matching.
Probably the easiest way to perform pattern matching is with functionshfst.compile_pmatch_expression andhfst.compile_pmatch_file
Initialize a PmatchContainer. Is this needed?
Create a PmatchContainer based on definitionsdefs.
defs: A tuple of transducers inHFST_OLW_TYPE defining how pmatch is done.
An example:
If we have a file namedstreets.txt that contains:
define CapWord UppercaseAlpha Alpha* ;define StreetWordFr [{avenue} | {boulevard} | {rue}] ;define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ;define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ;regex StreetFr EndTag(FrenchStreetName) ;and which has been earlier compiled and stored in filestreets.pmatch.hfst.ol:
defs = hfst.compile_pmatch_file('streets.txt')ostr = hfst.HfstOutputStream(filename='streets.pmatch.hfst.ol', type=hfst.ImplementationType.HFST_OLW_TYPE)for tr in defs: ostr.write(tr)ostr.close()we can read the pmatch definitions from file and perform string matching with:
istr = hfst.HfstInputStream('streets.pmatch.hfst.ol')defs = []while(not istr.is_eof()): defs.append(istr.read())istr.close()cont = hfst.PmatchContainer(defs)assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."See also:hfst.compile_pmatch_file,hfst.compile_pmatch_expression
Match inputinput.
todo
todo
todo
todo
The locations of pmatched strings for stringinput where the results are limitedas defined bytime_cutoff andweight_cutoff.
input: The input string.time_cutoff: Time cutoff, defaults to zero, i.e. no cutoff.weight_cutoff: Weight cutoff, defaults to infinity, i.e. no cutoff.
Returns: A tuple of tuples of Location.
Tokenizeinput and return a list of tokens i.e. strings.
input: The string to be tokenized.
Tokenizeinput and get a string representation of the tokenization(essentially the same that command line tool hfst-tokenize would give).
input: The input string to be tokenized.kwargs: Possible parameters are:output_format,max_weight_classes,dedupe,print_weights,print_all,time_cutoff,verbose,beam,tokenize_multichar.output_format: The format of output; possible values aretokenize,xerox,cg,finnpos,giellacg,conllu andvisl;tokenize being the default.max_weight_classes: Maximum number of best weight classes to output (where analyses with equal weight constitute a class), defaults to None i.e. no limit.dedupe: Whether duplicate analyses are removed, defaults to False.print_weights: Whether weights are printd, defaults to False.print_all: Whether nonmatching text is printed, defaults to False.time_cutoff: Maximum number of seconds used per input after limiting the search.verbose: Whether input is processed verbosely, defaults to True.beam: Beam within analyses must be to get printed.tokenize_multichar: Tokenize input into multicharacter symbols present in the transducer, defaults to false.