aosingh/lexpyPublic

NotificationsYou must be signed in to change notification settings
Fork7
Star55

Python package for lexicon; Trie and DAWG implementation.

License

GPL-3.0 license

55 stars 7 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
images		images
lexpy		lexpy
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lexpy_trie_dawg_nodes.png		lexpy_trie_dawg_nodes.png
lexpy_trie_dawg_time.png		lexpy_trie_dawg_time.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Lexpy

A lexicon is a data-structure which stores a set of words. The difference betweena dictionary and a lexicon is that in a lexicon there areno values associated with the words.
A lexicon is similar to a list or a set of words, but the internal representation is different and optimizedfor faster searches of words, prefixes and wildcard patterns.
Given a word, precisely, the search time is O(W) where W is the length of the word.
2 important lexicon data-structures areTrie andDirected Acyclic Word Graph (DAWG).

Install

lexpy can be installed via Python Package Index(PyPI) usingpip. The only installation requirement is that you need Python 3.7 or higher.

pip install lexpy

Interface

Interface Description	Trie	DAWG
Add a single word	`add('apple', count=2)`	`add('apple', count=2)`
Add multiple words	`add_all(['advantage', 'courage'])`	`add_all(['advantage', 'courage'])`
Check if exists?	`in` operator	`in` operator
Search using wildcard expression	`search('a?b*', with_count=True)`	`search('a?b*, with_count=True)`
Search for prefix matches	`search_with_prefix('bar', with_count=True)`	`search_with_prefix('bar')`
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance	`search_within_distance('apble', dist=1, with_count=True)`	`search_within_distance('apble', dist=1, with_count=True)`
Get the number of nodes in the automaton	`len(trie)`	`len(dawg)`

Examples

Trie

Build from an input list, set, or tuple of words.

fromlexpyimportTrietrie=Trie()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']trie.add_all(input_words)# You can pass any sequence types or a file-like object hereprint(trie.get_word_count())>>>48

Build from a file or file path.

In the file, words should be newline separated.

fromlexpyimportTrie# Eithertrie=Trie()trie.add_all('/path/to/file.txt')# Orwithopen('/path/to/file.txt','r')asinfile:trie.add_all(infile)

Check if exists using the`in` operator

print('ampyx'intrie)>>>True

Prefix search

print(trie.search_with_prefix('ab'))>>> ['abhor','abuzz']

print(trie.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]

Wildcard search using`?` and`*`

? = 0 or 1 occurrence of any character
* = 0 or more occurrence of any character

print(trie.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(trie.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(trie.search('su?t'))>>> ['suit']print(trie.search('su?t',with_count=True))>>> [('suit',1)]

Search for similar words using the notion of Levenshtein distance

print(trie.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]

Increment word count

You can either add a new word or increment the counter for an existing word.

trie.add('athie',count=1000)print(trie.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',1003), ('arbil',1), ('auric',1)]

Directed Acyclic Word Graph (DAWG)

DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is alwaysless than or equal to the number of nodes in Trie.
They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
In the current implementation of DAWG, the insertion order of the words should bealphabetical.
The implementation idea of DAWG is borrowed fromhttp://stevehanov.ca/blog/?id=115

fromlexpyimportTrie,DAWGtrie=Trie()trie.add_all(['advantageous','courageous'])dawg=DAWG()dawg.add_all(['advantageous','courageous'])len(trie)# Number of Nodes in Trie23dawg.reduce()# Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.len(dawg)# Number of nodes in DAWG21

DAWG

The APIs are exactly same as the Trie APIs

Build a DAWG

fromlexpyimportDAWGdawg=DAWG()input_words= ['ampyx','abuzz','athie','athie','athie','amato','amato','aneto','aneto','aruba','arrow','agony','altai','alisa','acorn','abhor','aurum','albay','arbil','albin','almug','artha','algin','auric','sore','quilt','psychotic','eyes','cap','suit','tank','common','lonely','likeable''language','shock','look','pet','dime','small''dusty','accept','nasty','thrill','foot','steel','steel','steel','steel','abuzz']dawg.add_all(input_words)dawg.reduce()dawg.get_word_count()>>>48

Check if exists using the`in` operator

print('ampyx'indawg)>>>True

Prefix search

print(dawg.search_with_prefix('ab'))>>> ['abhor','abuzz']

print(dawg.search_with_prefix('ab',with_count=True))>>> [('abuzz',2), ('abhor',1)]

Wildcard search using`?` and`*`

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(dawg.search('a*o*'))>>> ['amato','abhor','aneto','arrow','agony','acorn']print(dawg.search('a*o*',with_count=True))>>> [('amato',2), ('abhor',1), ('aneto',2), ('arrow',1), ('agony',1), ('acorn',1)]print(dawg.search('su?t'))>>> ['suit']print(dawg.search('su?t',with_count=True))>>> [('suit',1)]

Search for similar words using the notion of Levenshtein distance

print(dawg.search_within_distance('arie',dist=2))>>> ['athie','arbil','auric']print(dawg.search_within_distance('arie',dist=2,with_count=True))>>> [('athie',3), ('arbil',1), ('auric',1)]

Alphabetical order insertion

If you insert a word which is lexicographically out-of-order,ValueError will be raised.

dawg.add('athie',count=1000)

ValueError

ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>

Increment the word count

You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.

dawg.add_all(['thrill']*20000)# or dawg.add('thrill', count=20000)print(dawg.search('thrill',with_count=True))>> [('thrill',20001)]

Special Characters

Special characters, except? and*, are matched literally.

fromlexpyimportTriet=Trie()t.add('a©')

t.search('a©')>> ['a©']

t.search('a?')>> ['a©']

t.search('?©')>> ['a©']

Trie vs DAWG

Future Work

These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.

Merge trie and DAWG features in one data structure
- Support all functionalities and still be as compressed as possible.
Serialization / Deserialization
- Pickle is definitely an option.
Server (TCP or HTTP) to serve queries over the network.

Fun Facts

The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary.So for all english words, the search time is bounded by O(45).
The longest technical word(not in dictionary) is the name of a protein called astitin. It has 189,819letters and it is disputed whether it is a word.

About

Python package for lexicon; Trie and DAWG implementation.

Releases7

v1.1.0 Latest

Jun 10, 2024

+ 6 releases

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

License

aosingh/lexpy

Folders and files

Latest commit

History

Repository files navigation

Lexpy

Install

Interface

Examples

Trie

Build from an input list, set, or tuple of words.

Build from a file or file path.

Check if exists using thein operator

Prefix search

Wildcard search using? and*

Search for similar words using the notion of Levenshtein distance

Increment word count

Directed Acyclic Word Graph (DAWG)

DAWG

Build a DAWG

Check if exists using thein operator

Prefix search

Wildcard search using? and*

Search for similar words using the notion of Levenshtein distance

Alphabetical order insertion

Increment the word count

Special Characters

Trie vs DAWG

Future Work

Fun Facts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Check if exists using the`in` operator

Wildcard search using`?` and`*`

Check if exists using the`in` operator

Wildcard search using`?` and`*`

Packages