- Notifications
You must be signed in to change notification settings - Fork7
Python package for lexicon; Trie and DAWG implementation.
License
aosingh/lexpy
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A lexicon is a data-structure which stores a set of words. The difference betweena dictionary and a lexicon is that in a lexicon there areno values associated with the words.
A lexicon is similar to a list or a set of words, but the internal representation is different and optimizedfor faster searches of words, prefixes and wildcard patterns.
Given a word, precisely, the search time is O(W) where W is the length of the word.
2 important lexicon data-structures areTrie andDirected Acyclic Word Graph (DAWG).
lexpy
can be installed via Python Package Index(PyPI)
usingpip
. The only installation requirement is that you need Python 3.7 or higher.
pip install lexpy
Interface Description | Trie | DAWG |
---|---|---|
Add a single word | add('apple', count=2) | add('apple', count=2) |
Add multiple words | add_all(['advantage', 'courage']) | add_all(['advantage', 'courage']) |
Check if exists? | in operator | in operator |
Search using wildcard expression | search('a?b*', with_count=True) | search('a?b*, with_count=True) |
Search for prefix matches | search_with_prefix('bar', with_count=True) | search_with_prefix('bar') |
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance | search_within_distance('apble', dist=1, with_count=True) | search_within_distance('apble', dist=1, with_count=True) |
Get the number of nodes in the automaton | len(trie) | len(dawg) |
fromlexpyimportTrietrie=Trie()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']trie.add_all(input_words)# You can pass any sequence types or a file-like object hereprint(trie.get_word_count())>>>48
In the file, words should be newline separated.
fromlexpyimportTrie# Eithertrie=Trie()trie.add_all('/path/to/file.txt')# Orwithopen('/path/to/file.txt','r')asinfile:trie.add_all(infile)
print('ampyx'intrie)>>>True
print(trie.search_with_prefix('ab'))>>> ['abhor','abuzz']
print(trie.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]
?
= 0 or 1 occurrence of any character*
= 0 or more occurrence of any character
print(trie.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(trie.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(trie.search('su?t'))>>> ['suit']print(trie.search('su?t',with_count=True))>>> [('suit',1)]
print(trie.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]
- You can either add a new word or increment the counter for an existing word.
trie.add('athie',count=1000)print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',1003), ('arbil',1), ('auric',1)]
DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is alwaysless than or equal to the number of nodes in Trie.
They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
In the current implementation of DAWG, the insertion order of the words should bealphabetical.
The implementation idea of DAWG is borrowed fromhttp://stevehanov.ca/blog/?id=115
fromlexpyimportTrie,DAWGtrie=Trie()trie.add_all(['advantageous','courageous'])dawg=DAWG()dawg.add_all(['advantageous','courageous'])len(trie)# Number of Nodes in Trie23dawg.reduce()# Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.len(dawg)# Number of nodes in DAWG21
The APIs are exactly same as the Trie APIs
fromlexpyimportDAWGdawg=DAWG()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']dawg.add_all(input_words)dawg.reduce()dawg.get_word_count()>>>48
print('ampyx'indawg)>>>True
print(dawg.search_with_prefix('ab'))>>> ['abhor','abuzz']
print(dawg.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]
?
= 0 or 1 occurance of any character
*
= 0 or more occurance of any character
print(dawg.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(dawg.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(dawg.search('su?t'))>>> ['suit']print(dawg.search('su?t',with_count=True))>>> [('suit',1)]
print(dawg.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(dawg.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]
If you insert a word which is lexicographically out-of-order,ValueError
will be raised.
dawg.add('athie',count=1000)
ValueError
ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>
- You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.
dawg.add_all(['thrill']*20000)# or dawg.add('thrill', count=20000)print(dawg.search('thrill',with_count=True))>> [('thrill',20001)]
Special characters, except?
and*
, are matched literally.
fromlexpyimportTriet=Trie()t.add('a©')
t.search('a©')>> ['a©']
t.search('a?')>> ['a©']
t.search('?©')>> ['a©']
These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.
- Merge trie and DAWG features in one data structure
- Support all functionalities and still be as compressed as possible.
- Serialization / Deserialization
- Pickle is definitely an option.
- Server (TCP or HTTP) to serve queries over the network.
- The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary.So for all english words, the search time is bounded by O(45).
- The longest technical word(not in dictionary) is the name of a protein called astitin. It has 189,819letters and it is disputed whether it is a word.
About
Python package for lexicon; Trie and DAWG implementation.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.