bjascob/LemmInflectPublic

NotificationsYou must be signed in to change notification settings
Fork25
Star272

A python module for English lemmatization and inflection.

License

MIT license

272 stars 25 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
lemminflect		lemminflect
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
readthedocs.yml		readthedocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

LemmInflect

A python module for English lemmatization and inflection.

About

LemmInflect uses a dictionary approach to lemmatize English words and inflect them into formsspecified by a user suppliedUniversal DependenciesorPenn Treebanktag. The library works with out-of-vocabulary (OOV) words by applying neural network techniquesto classify word forms and choose the appropriate morphing rules.

The system acts as a standalone module or as an extension to thespaCy NLP system.

The dictionary and morphology rules are derived from theNIH's SPECIALIST Lexiconwhich contains an extensive set information on English word forms.

A more simplistic inflection only system is available aspyInflect.LemmInflect was created to address some of the shortcoming of that project and add features, such as...

Independence from the spaCy lemmatizer
Neural nets to disambiguate out of vocab morphology
Unigrams to dismabiguate spellings and multiple word forms

Documentation

For the latest documentation, seeReadTheDocs.

Accuracy of the Lemmatizer

The accuracy of LemmInflect and several other popular NLP utilities was tested using theAutomatically Generated Inflection Database (AGID) as abaseline. This is not a "gold" standard dataset but it has an extensive list oflemmas and their corresponding inflections and appears to be generaly a "good" set for testing.Each inflection was lemmatized by the test software and then compared to the original value in thecorpus. The test included 119,194 different inflected words.

| Package                      | Verb  |  Noun | ADJ/ADV | Overall |  Speed   ||-----------------------------------------------------------------------------|| LemmInflect 0.2.3            | 96.1% | 95.4% |  93.9%  |  95.6%  |  42.0 uS || Stanza 1.5.0 + CoreNLP 4.5.4 | 94.0% | 96.4% |  93.1%  |  95.5%  |  30.0 us || spaCy 3.5.0                  | 79.5% | 88.9% |  60.5%  |  84.7%  | 393.0 uS || NLTK 3.8.1                   | 53.3% | 52.2% |  53.3%  |  52.6%  |  12.0 uS ||-----------------------------------------------------------------------------|

Speed is in micro-seconds per lemma and was conducted on a i9-7940x CPU. Note, since Stanza is makingcalls to the java CoreNLP software, all 120K test cases were grouped into a single call. For Spacy,all pipeline components were disabled except the lemmatizer. The high per lemma time is probablya reflection of the general overhead of the pipeline architecture.

Requirements and Installation

The only external requirement to run LemmInflect isnumpy which is used for the matrix math that drives the neural nets. These nets are relatively small and don't require significant CPU power to run.

To install do..

pip3 install lemminflect

The project was built and tested under Python 3 and Ubuntu but should run on any Linux, Windows, Mac, etc.. system. It is untested under Python 2 but may function in that environment with minimal or no changes.

The code base also includes library functions and scripts to create the various data files and neural nets. This includes such things as...

Unigram Extraction from the Gutenberg and Billion Word Corpra
Python scripts for loading and parsing the SPECIALIST Lexicon
Nerual network training based on Keras and Tensorflow

None of these are required for run-time operation. However, if you want of modify the system, see thedocumentation for more info.

Library Usage

To lemmatize a word use the methodgetLemma(). This takes a word and a Universal Dependencies tag and returns the lemmas as a list of possible spellings. The dictionary system is used first, and if no lemma is found, the rules system is employed.

> from lemminflect import getLemmagetLemma('watches', upos='VERB')('watch',)

To inflect words, use the methodgetInflection. This takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with that tag. Similary to above, the dictionary is used first and then inflection rules are applied if needed..

> from lemminflect import getInflection> getInflection('watch', tag='VBD')('watched',)> getInflection('xxwatch', tag='VBD')('xxwatched',)

The library provides lower-level functions to access the dictionary and the OOV rules directly. For a detailed description seeLemmatizer orInflections.

Usage as a Spacy Extension

To use as an extension, you need spaCy version 2.0 or later. Versions 1.9 and earlier do not support the extension methods used here.

To setup the extension, first importlemminflect. This will create newlemma andinflect methods for each spaCyToken. The methods operate similarly to the methods described above, with the exception that a string is returned, containing the most common spelling, rather than a tuple.

> import spacy> import lemminflect> nlp = spacy.load('en_core_web_sm')> doc = nlp('I am testing this example.')> doc[2]._.lemma()test> doc[4]._.inflect('NNS')examples

Issues

If you find a bug, please report it on theGitHub issues list. However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise. Some of these are not readily fixable. Issues with inflected forms include...

Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)
Mass form and plural types (ie.. people vs peoples)
Forms that depend on context (ie.. further vs farther)
Infections that are not fully specified by the tag (ie.. be/VBD can be "was" or "were")

One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag. For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". In order to disambiguate these forms, other words in the sentence need to be inspected. At this time, LemmInflect doesn't include this functionality.

About

A python module for English lemmatization and inflection.

Releases5

LemmInflect 0.2.3 Latest

Oct 2, 2022

+ 4 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LemmInflect

About

Documentation

Accuracy of the Lemmatizer

Requirements and Installation

Library Usage

Usage as a Spacy Extension

Issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases5

Packages

Uh oh!

Languages

Movatterモバイル変換

License

bjascob/LemmInflect

Folders and files

Latest commit

History

Repository files navigation

LemmInflect

About

Documentation

Accuracy of the Lemmatizer

Requirements and Installation

Library Usage

Usage as a Spacy Extension

Issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases5

Packages0

Uh oh!

Languages

Packages