Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A tool for learning significant phrase/term models, and efficiently labeling with them.

License

NotificationsYou must be signed in to change notification settings

soaxelbrooke/phrase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A CLI tool and server for learning significant phrase/term models, and efficiently labeling with them.

Installation

Download and extract therelease archive for your OS, and put thephrase binary somewhere on the PATH (like/usr/local/bin). If you're using linux, the GNU binary currently appears to be 5-10x faster than the musl version, so try that first.

For example, installing the linux binary:

$ wget https://github.com/soaxelbrooke/phrase/releases/download/0.3.6/phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz$ tar -xzvf phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz$ sudo mv phrase /usr/local/bin/

Use

In general, usingphrase falls into 3 steps:

  1. Counting n-grams
  2. Exporting scored models
  3. Significant term/phrase extraction/transform or model serving

N-gram counting is done continuously, providing batches of documents as they come in. Model export reads all n-gram counts so far and calculates mutual information-based collocations - you can then deploy the models by shipping the binary anddata/scores_* files to a server. Labeling (identifying all significant terms and phrases in text) or transforming (eager replace of longest found phrases in text) can be done either via the CLI or the web server.Providing labels for documents is not necessary for learning phrases, but does help, and allows for significant term labeling also.

Training a phrase model

This example uses theassets/reviews.json data in the repo, 10k app reviews:

$ head -1 assets/reviews.json{"body": "Woww! Moon Invoice is just so amazing. I don\u2019t think any such app exists that works so wonderfully. I am awestruck by the experience.", "category": "Business", "sentiment": "positive"}

First, you need to count n-grams from your data:

$ phrase count --mode json assets/reviews.json --textfield body --labelfield category --labelfield sentiment

(This creates n-gram count files atdata/counts_*)

Then, you need to export scored phrase models:

$ phrase export

(This will create scored phrase models atdata/scores_*)

Validating Learned Phrases

You can validate the phrases being learned per-label with theshow command:

$ phrase show -n 3Label=Newshash,ngram,score3178089391134982486,New Yorker,0.514228702816309618070968419002659619,long form,0.509673778342564716180697492236521925,sleep timer,0.5047391214969927Label=Businesshash,ngram,score4727477585106156155,iTimePunch Plus,0.5574920344484112483914742025992948,Crew Lounge,0.547912937008602111796198430323558093,black and white,0.5385891753319711...

hash is the hash of the stemmed n-gram,ngram is the canonical version of the n-gram used for display purposes. For phrases,score is a combination of NPMI(phrase, tokens) and NPMI(n-gram, label), and is only NPMI(n-gram, label) for single tokens.

Transforming Text

$ echo "The weather channel is great for when I want to check the weather!" | phrase transform --label Weather -The Weather_Channel is great for when I_want_to check_the_weather!

Modes allow CSV, JSON, and plaintext (the default). CSV and JSON will maintain the rest of the document/row, but replace text in the specified--textlabel fields (or in thetext field if not specified).

Serving Scored Phrase Models

$ phrase serve

It also accepts--port and--host parameters.

API Routes

GET /labels - list all available labels for extraction/labeling.

$ curl localhost:6220/labels{"labels":["Social Networking","Travel","negative","Weather","positive","Business","News","neutral",null]}

POST /analyze - identifies all significant phrases and terms found in the provided documents.

$ curl -XPOST localhost:6220/analyze -d '{"documents": [{"labels": ["Weather", "positive", null], "text": "The weather channel is great for when I want to check the weather!"}]}'[{"labels":["Weather","positive"],"ngrams":["I want","I want to","I want to check","Weather Channel","channel","check","check the weather","want to","want to check","want to check the weather","weather","when I want","when I want to"],"text":"The weather channel is great for when I want to check the weather!"}]

POST /transform - eagerly replaces the longest phrases found in the provided documents.

$ curl -XPOST localhost:6220/transform -d '{"documents": [{"labels": ["Weather"], "text": "The weather channel is great for when I want to check the weather!"}]}'[{"label":"Weather","text":"The Weather_Channel is great for when I_want_to check_the_weather!"}]

Labels

Labels are used to learn significant single tokens and to aid in scoring significant phrases. Whilephrase can be used without providing labels, providing them allows it to learn more nuanced phrases, like used by a specific community or when describing a specific product. Labels are generally provided in thelabel field of the input file, specified using--labelfield argument, or with the--label argument.

Providing labels for your data causesphrase to count them into separate bags per label, and during export allows it to calculate an extra significance score based on label (instead of just co-occurance). This means that a phrase that is unique to that label is much more likely to be picked up than if it was being overshadowed in unlabeled data.

An example of a good label would be app category, as apps in each category are related, and customer reviews talk about similar subjects. An example of a bad label would be user ID, since it would be very high cardinality, cause very bad performance, and likely wouldn't learn useful phrases or terms due to data sparsity per user.

Performance

It's fast.

It takes 0.66 second to count 1 to 5-grams for 10,000 reviews, and ~1.2 seconds to export. Performance is primarily based on n-gram size, the number of labels, and vocab size. For example, labeling on iOS app category (23 labels) using default parameters on an Intel Core i7-7820HQ (Ubuntu):

TaskTokens per Second per Thread
Counting n-grams779,025
Exporting scored models206,704
Labeling significant terms354,395
Phrase transformation345,957

Note: Exports do not gain much from parallelization

Environment Variables

A variety of environment variables can be used:

ROCKET_ADDRESS - The address so serve on, defaults to localhost. (OtherRocket configs can be assigned, also)

ROCKET_PORT - The port to serve on, defaults to 6220.

LANG - Determines the stemmer language to use,ISO 639-1. Should be set automatically on Unix systems, but can be overridden.

TOKEN_REGEX - The regular expression used to find tokens when learning and labeling phrases.

CHUNK_SPLIT_REGEX - The regular expression used to detect chunk boundaries, across which phrases aren't learned.

HEAD_IGNORES /TAIL_IGNORES - Used to ignore phrases that start or end with a token, comma separated. For instance,TAIL_IGNORES=the would ignore 'I love the'.

PRUNE_AT - The size at which to prune the n=gram count mapping. Useful for limiting memory usage, default is 5000000.

PRUNE_TO - Controls what size n-gram mappings are pruned to during pruning. Also sets the number of n-grams that are saved after counting (sorted by count). Default is 2000000.

BATCH_SIZE - Controls the document batch size. Causes input streams to be batched, allowing larger than memory datasets. Default is 1000000.

MAX_NGRAM - The highest n-gram size to count to, higher values cause slower counting, but allow for more specific and longer phrases. Default is 5.

MIN_NGRAM - The lowest n-gram size to export, default is 1 (unigrams).

MIN_COUNT - The minimum n-gram count for a phrase or token to be considered significant. Default is 5.

MIN_SCORE - The minimum NPMI score for a term or phrase to be considered significant. Default is 0.1.

MAX_EXPORT - The maximum size of exported models, per label.

NGRAM_DELIM - The delimiter used to join phrases when counting and scoring. Default is.

Citations

Normalized (Pointwise) Mutual Information in Collocation Extraction - Gerlof Bouma

About

A tool for learning significant phrase/term models, and efficiently labeling with them.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp