Movatterモバイル変換

soaxelbrooke/phrasePublic

NotificationsYou must be signed in to change notification settings
Fork3
Star33

A tool for learning significant phrase/term models, and efficiently labeling with them.

License

Apache-2.0 license

33 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
ci		ci
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
Rocket.toml		Rocket.toml

Repository files navigation

A CLI tool and server for learning significant phrase/term models, and efficiently labeling with them.

Installation

Download and extract therelease archive for your OS, and put thephrase binary somewhere on the PATH (like/usr/local/bin). If you're using linux, the GNU binary currently appears to be 5-10x faster than the musl version, so try that first.

For example, installing the linux binary:

$ wget https://github.com/soaxelbrooke/phrase/releases/download/0.3.6/phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz$ tar -xzvf phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz$ sudo mv phrase /usr/local/bin/

Use

In general, usingphrase falls into 3 steps:

Counting n-grams
Exporting scored models
Significant term/phrase extraction/transform or model serving

N-gram counting is done continuously, providing batches of documents as they come in. Model export reads all n-gram counts so far and calculates mutual information-based collocations - you can then deploy the models by shipping the binary anddata/scores_* files to a server. Labeling (identifying all significant terms and phrases in text) or transforming (eager replace of longest found phrases in text) can be done either via the CLI or the web server.Providing labels for documents is not necessary for learning phrases, but does help, and allows for significant term labeling also.

Training a phrase model

This example uses theassets/reviews.json data in the repo, 10k app reviews:

$ head -1 assets/reviews.json{"body": "Woww! Moon Invoice is just so amazing. I don\u2019t think any such app exists that works so wonderfully. I am awestruck by the experience.", "category": "Business", "sentiment": "positive"}

First, you need to count n-grams from your data:

$ phrase count --mode json assets/reviews.json --textfield body --labelfield category --labelfield sentiment

(This creates n-gram count files atdata/counts_*)

Then, you need to export scored phrase models:

$ phrase export

(This will create scored phrase models atdata/scores_*)

Validating Learned Phrases

You can validate the phrases being learned per-label with theshow command:

$ phrase show -n 3Label=Newshash,ngram,score3178089391134982486,New Yorker,0.514228702816309618070968419002659619,long form,0.509673778342564716180697492236521925,sleep timer,0.5047391214969927Label=Businesshash,ngram,score4727477585106156155,iTimePunch Plus,0.5574920344484112483914742025992948,Crew Lounge,0.547912937008602111796198430323558093,black and white,0.5385891753319711...

hash is the hash of the stemmed n-gram,ngram is the canonical version of the n-gram used for display purposes. For phrases,score is a combination of NPMI(phrase, tokens) and NPMI(n-gram, label), and is only NPMI(n-gram, label) for single tokens.

Transforming Text

$ echo "The weather channel is great for when I want to check the weather!" | phrase transform --label Weather -The Weather_Channel is great for when I_want_to check_the_weather!

Modes allow CSV, JSON, and plaintext (the default). CSV and JSON will maintain the rest of the document/row, but replace text in the specified--textlabel fields (or in thetext field if not specified).

Serving Scored Phrase Models

$ phrase serve

It also accepts--port and--host parameters.

API Routes

GET /labels - list all available labels for extraction/labeling.

$ curl localhost:6220/labels{"labels":["Social Networking","Travel","negative","Weather","positive","Business","News","neutral",null]}

POST /analyze - identifies all significant phrases and terms found in the provided documents.

$ curl -XPOST localhost:6220/analyze -d '{"documents": [{"labels": ["Weather", "positive", null], "text": "The weather channel is great for when I want to check the weather!"}]}'[{"labels":["Weather","positive"],"ngrams":["I want","I want to","I want to check","Weather Channel","channel","check","check the weather","want to","want to check","want to check the weather","weather","when I want","when I want to"],"text":"The weather channel is great for when I want to check the weather!"}]

POST /transform - eagerly replaces the longest phrases found in the provided documents.

$ curl -XPOST localhost:6220/transform -d '{"documents": [{"labels": ["Weather"], "text": "The weather channel is great for when I want to check the weather!"}]}'[{"label":"Weather","text":"The Weather_Channel is great for when I_want_to check_the_weather!"}]

Labels

Labels are used to learn significant single tokens and to aid in scoring significant phrases. Whilephrase can be used without providing labels, providing them allows it to learn more nuanced phrases, like used by a specific community or when describing a specific product. Labels are generally provided in thelabel field of the input file, specified using--labelfield argument, or with the--label argument.

Providing labels for your data causesphrase to count them into separate bags per label, and during export allows it to calculate an extra significance score based on label (instead of just co-occurance). This means that a phrase that is unique to that label is much more likely to be picked up than if it was being overshadowed in unlabeled data.

An example of a good label would be app category, as apps in each category are related, and customer reviews talk about similar subjects. An example of a bad label would be user ID, since it would be very high cardinality, cause very bad performance, and likely wouldn't learn useful phrases or terms due to data sparsity per user.

Performance

It's fast.

It takes 0.66 second to count 1 to 5-grams for 10,000 reviews, and ~1.2 seconds to export. Performance is primarily based on n-gram size, the number of labels, and vocab size. For example, labeling on iOS app category (23 labels) using default parameters on an Intel Core i7-7820HQ (Ubuntu):

Task	Tokens per Second per Thread
Counting n-grams	779,025
Exporting scored models	206,704
Labeling significant terms	354,395
Phrase transformation	345,957

Note: Exports do not gain much from parallelization

Environment Variables

A variety of environment variables can be used:

ROCKET_ADDRESS - The address so serve on, defaults to localhost. (OtherRocket configs can be assigned, also)

ROCKET_PORT - The port to serve on, defaults to 6220.

LANG - Determines the stemmer language to use,ISO 639-1. Should be set automatically on Unix systems, but can be overridden.

TOKEN_REGEX - The regular expression used to find tokens when learning and labeling phrases.

CHUNK_SPLIT_REGEX - The regular expression used to detect chunk boundaries, across which phrases aren't learned.

HEAD_IGNORES /TAIL_IGNORES - Used to ignore phrases that start or end with a token, comma separated. For instance,TAIL_IGNORES=the would ignore 'I love the'.

PRUNE_AT - The size at which to prune the n=gram count mapping. Useful for limiting memory usage, default is 5000000.

PRUNE_TO - Controls what size n-gram mappings are pruned to during pruning. Also sets the number of n-grams that are saved after counting (sorted by count). Default is 2000000.

BATCH_SIZE - Controls the document batch size. Causes input streams to be batched, allowing larger than memory datasets. Default is 1000000.

MAX_NGRAM - The highest n-gram size to count to, higher values cause slower counting, but allow for more specific and longer phrases. Default is 5.

MIN_NGRAM - The lowest n-gram size to export, default is 1 (unigrams).

MIN_COUNT - The minimum n-gram count for a phrase or token to be considered significant. Default is 5.

MIN_SCORE - The minimum NPMI score for a term or phrase to be considered significant. Default is 0.1.

MAX_EXPORT - The maximum size of exported models, per label.

NGRAM_DELIM - The delimiter used to join phrases when counting and scoring. Default is.

Citations

Normalized (Pointwise) Mutual Information in Collocation Extraction - Gerlof Bouma

About

A tool for learning significant phrase/term models, and efficiently labeling with them.

Releases11

0.3.6 Latest

Nov 17, 2019

+ 10 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Installation

Use

Training a phrase model

Validating Learned Phrases

Transforming Text

Serving Scored Phrase Models

API Routes

Labels

Performance

Environment Variables

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases11

Packages

Uh oh!

Languages

Movatterモバイル変換

License

soaxelbrooke/phrase

Folders and files

Latest commit

History

Repository files navigation

Installation

Use

Training a phrase model

Validating Learned Phrases

Transforming Text

Serving Scored Phrase Models

API Routes

Labels

Performance

Environment Variables

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases11

Packages0

Uh oh!

Languages

Packages