- Notifications
You must be signed in to change notification settings - Fork7
Python package for lexicon; Trie and DAWG implementation.
License
aosingh/lexpy
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A lexicon is a data-structure which stores a set of words. The difference betweena dictionary and a lexicon is that in a lexicon there areno values associated with the words.
A lexicon is similar to a list or a set of words, but the internal representation is different and optimizedfor faster searches of words, prefixes and wildcard patterns.
Given a word, precisely, the search time is O(W) where W is the length of the word.
2 important lexicon data-structures areTrie andDirected Acyclic Word Graph (DAWG).
lexpy
can be installed via Python Package Index(PyPI)
usingpip
. The only installation requirement is that you need Python 3.7 or higher.
pip install lexpy
Interface Description | Trie | DAWG |
---|---|---|
Add a single word | add('apple', count=2) | add('apple', count=2) |
Add multiple words | add_all(['advantage', 'courage']) | add_all(['advantage', 'courage']) |
Check if exists? | in operator | in operator |
Search using wildcard expression | search('a?b*', with_count=True) | search('a?b*, with_count=True) |
Search for prefix matches | search_with_prefix('bar', with_count=True) | search_with_prefix('bar') |
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance | search_within_distance('apble', dist=1, with_count=True) | search_within_distance('apble', dist=1, with_count=True) |
Get the number of nodes in the automaton | len(trie) | len(dawg) |
fromlexpyimportTrietrie=Trie()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']trie.add_all(input_words)# You can pass any sequence types or a file-like object hereprint(trie.get_word_count())>>>48
In the file, words should be newline separated.
fromlexpyimportTrie# Eithertrie=Trie()trie.add_all('/path/to/file.txt')# Orwithopen('/path/to/file.txt','r')asinfile:trie.add_all(infile)
print('ampyx'intrie)>>>True
print(trie.search_with_prefix('ab'))>>> ['abhor','abuzz']
print(trie.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]
?
= 0 or 1 occurrence of any character*
= 0 or more occurrence of any character
print(trie.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(trie.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(trie.search('su?t'))>>> ['suit']print(trie.search('su?t',with_count=True))>>> [('suit',1)]
print(trie.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]
- You can either add a new word or increment the counter for an existing word.
trie.add('athie',count=1000)print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',1003), ('arbil',1), ('auric',1)]
DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is alwaysless than or equal to the number of nodes in Trie.
They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
In the current implementation of DAWG, the insertion order of the words should bealphabetical.
The implementation idea of DAWG is borrowed fromhttp://stevehanov.ca/blog/?id=115
fromlexpyimportTrie,DAWGtrie=Trie()trie.add_all(['advantageous','courageous'])dawg=DAWG()dawg.add_all(['advantageous','courageous'])len(trie)# Number of Nodes in Trie23dawg.reduce()# Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.len(dawg)# Number of nodes in DAWG21
The APIs are exactly same as the Trie APIs
fromlexpyimportDAWGdawg=DAWG()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']dawg.add_all(input_words)dawg.reduce()dawg.get_word_count()>>>48
print('ampyx'indawg)>>>True
print(dawg.search_with_prefix('ab'))>>> ['abhor','abuzz']
print(dawg.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]
?
= 0 or 1 occurance of any character
*
= 0 or more occurance of any character
print(dawg.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(dawg.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(dawg.search('su?t'))>>> ['suit']print(dawg.search('su?t',with_count=True))>>> [('suit',1)]
print(dawg.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(dawg.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]
If you insert a word which is lexicographically out-of-order,ValueError
will be raised.
dawg.add('athie',count=1000)
ValueError
ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>
- You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.
dawg.add_all(['thrill']*20000)# or dawg.add('thrill', count=20000)print(dawg.search('thrill',with_count=True))>> [('thrill',20001)]
Special characters, except?
and*
, are matched literally.
fromlexpyimportTriet=Trie()t.add('a©')
t.search('a©')>> ['a©']
t.search('a?')>> ['a©']
t.search('?©')>> ['a©']
These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.
- Merge trie and DAWG features in one data structure
- Support all functionalities and still be as compressed as possible.
- Serialization / Deserialization
- Pickle is definitely an option.
- Server (TCP or HTTP) to serve queries over the network.
- The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary.So for all english words, the search time is bounded by O(45).
- The longest technical word(not in dictionary) is the name of a protein called astitin. It has 189,819letters and it is disputed whether it is a word.
About
Python package for lexicon; Trie and DAWG implementation.