NotificationsYou must be signed in to change notification settings
Fork35
Star191

Python library for Natural Language Preprocessing (NLPre)

191 stars 35 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 424 Commits
development		development
nlpre		nlpre
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
MANIFEST.in		MANIFEST.in
README.md		README.md
fabfile.py		fabfile.py
long_example.md		long_example.md
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

Natural Language Preprocessing (NLPre)

Major version update! NLPre 2.0.0

Backend NLP enginepattern.en has been replaced withspaCy v 2.1.0. This is a major fix for some of the problems withpattern.en including poor lemmatization. (eg. cytokine -> cytocow)
Support for python 2 has been dropped
Support for custom dictionaries inreplace_from_dictionary
Option for suffix to be used instead of prefix inreplace_from_dictionary
URL replacement can now remove emails
token_replacement can remove symbols

NLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data.Correcting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.

While this library was developed by theOffice of Portfolio Analysis at theNational Institutes of Health to correct for historical artifacts in our data, we envision this module to encompass a broad spectrum of problems encountered in the preprocessing step of natural language processing.

NLPre is part of theword2vec-pipeline.

Installation

For the latest release, use

pip install nlpre

If installing the python 3 version on Ubuntu, you may need to use

sudo apt-get install libmysqlclient-dev

Example

fromnlpreimporttitlecaps,dedash,identify_parenthetical_phrasesfromnlpreimportreplace_acronyms,replace_from_dictionarytext= ("LYMPHOMA SURVIVORS IN KOREA. Describe the correlates of unmet needs ""among non-Hodgkin lymphoma (NHL) surv- ivors in Korea and identify ""NHL patients with an abnormal white blood cell count.")ABBR=identify_parenthetical_phrases()(text)parsers= [dedash(),titlecaps(),replace_acronyms(ABBR),replace_from_dictionary(prefix="MeSH_")]forfinparsers:text=f(text)print(text)''' lymphoma survivors in korea .    Describe the correlates of unmet needs among non_Hodgkin_lymphoma    ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma    patients with an abnormal MeSH_Leukocyte_Count . '''

A longer example highlighting a "pipeline" of changes can be foundhere.

To see a detailed log of the changes made, set the level tologging.INFO orlogging.DEBUG,

importnlpre,loggingnlpre.logger.setLevel(logging.INFO)

What's included?

Function	Description
replace_from_dictionary	Replace phrases from an input dictionary. The replacement is done without regard to case, but punctuation is handled correctly. TheMeSH (Medical Subject Headings) dictionary is built-in. `(11-Dimethylethyl)-4-methoxyphenol is great` `MeSH_Butylated_Hydroxyanisole is great`
replace_acronyms	Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used. `The EPA protects trees` `The Environmental_Protection_Agency protects trees`
identify_parenthetical_phrases	Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into`replace_acronyms`. `'Environmental Protection Agency (EPA)` `Counter((('Environmental', 'Protection', 'Agency'), 'EPA'):1)`
separated_parenthesis	Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary. `Hello (it is a beautiful day) world.` `Hello world. it is a beautiful day .`
pos_tokenizer	Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the`spaCy` module. `The boy threw the ball into the yard` `boy ball yard`
unidecoder	Converts Unicode phrases into ASCII equivalent. `α-Helix β-sheet` `a-Helix b-sheet`
dedash	Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list. `How is the treat- ment going` `How is the treatment going`
decaps_text	We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns. `James and Sally had a fMRI` `james and sally had a fMRI`
titlecaps	Documents sometimes have sentences that are entirely in uppercase (commonly found in titles and abstracts of older documents). This parser identifies sentences where every word is uppercase, and returns the document with these sentences converted to lowercase. `ON THE STRUCTURE OF WATER.` `On the structure of water .`
token_replacement	Simple token replacement. `Observed > 20%` `Observed greater-than 20 percent`
separate_reference	Separates and optionally removes references that have been concatenated onto words. `Key feature of interleukin-1 in Drosophila3-5 and elegans(7).` `Key feature of interleukin-1 in Drosophila and elegans .`
url_replacement	Removes or replaces URLs `The source code is [here](www.github.com/NIHOPA/NLPre/).` `The source code is [here](LINK).`

Citations and Acknowledgments

He, Jian and Chaomei Chen.Predictive Effects of Novelty Measured by Temporal Embeddings on the Growth of Scientific Literature. Frontiers in Research Metrics and Analytics, 3, 9. (2018).
He, Jian and Chaomei Chen.Temporal Representations of Citations for Understanding the Changing Roles of Scientific Publications. Front. Res. Metr. Anal. (2018).
Galea, Dieter et al.Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization. BioNLP (2018).

Contributors

License

This project is in the public domain within the United States, andcopyright and related rights in the work worldwide are waived throughtheCC0 1.0 Universal public domain dedication.

About

Python library for Natural Language Preprocessing (NLPre)

Releases5

spaCy backend Latest

Mar 19, 2019

+ 4 releases

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Natural Language Preprocessing (NLPre)

Major version update! NLPre 2.0.0

Installation

Example

What's included?

Citations and Acknowledgments

Contributors

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases5

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

NIHOPA/NLPre

Folders and files

Latest commit

History

Repository files navigation

Natural Language Preprocessing (NLPre)

Major version update! NLPre 2.0.0

Installation

Example

What's included?

Citations and Acknowledgments

Contributors

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases5

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages