NotificationsYou must be signed in to change notification settings
Fork2
Star9

Wayward is a Python package that helps to identify characteristic terms from single documents or groups of documents. It can be used for keyword extraction and several related tasks, and can create efficient sparse representations for classifiers. It was originally created to provide term weights for word clouds.

wayward.readthedocs.io

License

View license

9 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
docs		docs
example		example
tests		tests
wayward		wayward
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.rst		README.rst
gpl-3.0.txt		gpl-3.0.txt
lgpl-3.0.txt		lgpl-3.0.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

Wayward

Wayward is a Python package that helps to identify characteristic terms fromsingle documents or groups of documents. It can be used for keyword extractionand several related tasks, and can create efficient sparse representations forclassifiers. It was originally created to provide term weights for word clouds.

Rather than use simple term frequency to estimate the importance of words andphrases, it weighs terms by statistical models known asparsimonious languagemodels. These models are good at picking up the terms that distinguish a textdocument from other documents in a collection.

For this to work, a preferably large amount of documents is neededto serve as a background collection, to compare the documents of interest to.This could be a random sample of newspaper articles, for instance, but for manyapplications it works better to take a natural collection, such as a periodicalpublication, and to fit the model for separate parts (e.g. individual issues,or yearly groups of issues).

See theReferences section for more information about parsimoniouslanguage models and their applications.

Wayward does not do visualization of word clouds. For that, you can pasteits output into a tool likehttp://wordle.net or theIBM Word-Cloud Generator.

Installation

Either install the latest release from PyPI:

$ pip install wayward

or clone the git repository, and usePoetryto install the package in editable mode:

$ git clone https://github.com/aolieman/wayward.git$ cd wayward/$ poetry install

Usage

>>> quotes= [..."Love all, trust a few, Do wrong to none",........."A lover's eyes will gaze an eagle blind."..."A lover's ear will hear the lowest sound.",... ]>>> doc_tokens= [...     re.sub(r"[.,:;!?\"‘’]|'s\b","", quote).lower().split()...for quotein quotes... ]

TheParsimoniousLM is initialized with all document tokens as abackground corpus, and subsequently takes a single document's tokensas input. Itstop() method returns the top terms and their probabilities:

>>>from waywardimport ParsimoniousLM>>> plm= ParsimoniousLM(doc_tokens,w=.1)>>> plm.top(10, doc_tokens[-1])[('lover', 0.1538461408077277), ('will', 0.1538461408077277), ('eyes', 0.0769230704038643), ('gaze', 0.0769230704038643), ('an', 0.0769230704038643), ('eagle', 0.0769230704038643), ('blind', 0.0769230704038643), ('ear', 0.0769230704038643), ('hear', 0.0769230704038643), ('lowest', 0.0769230704038643)]

TheSignificantWordsLM is similarly initialized with a background corpus,but subsequently takes a group of document tokens as input. Itsgroup_topmethod returns the top terms and their probabilities:

>>>from waywardimport SignificantWordsLM>>> swlm= SignificantWordsLM(doc_tokens,lambdas=(.7,.1,.2))>>> swlm.group_top(10, doc_tokens[-2:],fix_lambdas=True)[('much', 0.09077675276900632), ('lover', 0.06298706244865138), ('will', 0.06298706244865138), ('you', 0.04538837638450315), ('your', 0.04538837638450315), ('rhymes', 0.04538837638450315), ('speak', 0.04538837638450315), ('neither', 0.04538837638450315), ('rhyme', 0.04538837638450315), ('nor', 0.04538837638450315)]

Seeexample/dickens.py for a runnable example with more realistic data.

Origin and Relaunch

This package started out asWeighWords,written by Lars Buitinck at the University of Amsterdam. It provides an efficientparsimonious LM implementation, and a very accessible API.

A recent innovation in language modeling, Significant Words LanguageModels, led to the addition of a two-way parsimonious language model to this package.This new version targets python 3.x, and after a long slumber deserved a fresh name.The name "Wayward" was chosen because it is a near-homophone of WeighWords, and asa nod to parsimonious language modeling: it uncovers which terms "depart" most fromthe background collection. The parsimonization algorithm discounts terms that arealready well explained by the background model, until the most wayward terms comeout on top.

See theChangelog for an overview of the most important changes.

References

D. Hiemstra, S. Robertson, and H. Zaragoza (2004).Parsimonious Language Modelsfor Information Retrieval.Proc. SIGIR'04.

R. Kaptein, D. Hiemstra, and J. Kamps (2010).How different are Language Modelsand word clouds?.Proc. ECIR'10.

M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx (2016).Luhn Revisited: Significant Words Language Models.Proc. CKIM'16.

About

wayward.readthedocs.io

Releases3

Wayward v0.3.2 Latest

Jun 9, 2019

+ 2 releases

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Wayward

Installation

Usage

Origin and Relaunch

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Languages

Movatterモバイル変換

License

aolieman/wayward

Folders and files

Latest commit

History

Repository files navigation

Wayward

Installation

Usage

Origin and Relaunch

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Languages

Packages