shangjingbo1226/AutoPhrasePublic

NotificationsYou must be signed in to change notification settings
Fork277
Star1.2k

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

License

Apache-2.0 license

1.2k stars 277 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
data		data
docker		docker
models/DBLP		models/DBLP
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
auto_phrase.sh		auto_phrase.sh
compile.sh		compile.sh
phrasal_segmentation.sh		phrasal_segmentation.sh

Repository files navigation

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Publications

Please cite the following two papers if you are using our tools. Thanks!

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora”, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed,slides)

Recent Changes

2020.06.14

Updates docker image with the git master.

2018.03.04

Fix a few bugs during the pre-processing and post-processing, i.e.,Tokeninzer.java. Previously, when the corpus contains characters like/, the results could be wrong or errors may occur.
When the phrasal segmentation is serving new text, for the phrases (every token is seen in the traning corpus) provided in the knowledge base (wiki_quality.txt), the score is set as1.0. Previously, it was kind of infinite.

2017.10.23

Support extremely large corpus (e.g., 4GB or more). Please comment out the// define LARGE in the beginning ofsrc/utils/parameters.h before you run AutoPhrase on such a large corpus.
Quality phrases (every token is seen in the raw corpus) provided in the knowledge base will be incorporated during the phrasal segmentation, even their frequencies are smaller thanMIN_SUP.
Stopwords will be treated as low quality single-word phrases.
Model files are saved separately. Please check the variableMODEL in bothauto_phrase.sh andphrasal_segmentation.sh.
The end of line is also a separator for sentence splitting.

New Features

(compared to SegPhrase)

Minimized Human Effort. We develop a robust positive-only distant training method to estimate the phrase quality by leveraging exsiting general knowledge bases.
Support Multiple Languages: English, Spanish, and Chinese. The languagein the input will be automatically detected.
High Accuracy. We propose a POS-guided phrasal segmentation model incorporating POS tags when POS tagger is available. Meanwhile, the new framework is able to extract single-word quality phrases.
High Efficiency. A better indexing and an almost lock-free parallelization are implemented, which lead to both running time speedup and memory saving.

Related GitHub Repositories

Requirements

Linux or MacOS with g++ and Java installed.

Ubuntu:

g++ 4.8$ sudo apt-get install g++-4.8
Java 8$ sudo apt-get install openjdk-8-jdk
curl$ sudo apt-get install curl

MacOS:

g++ 6$ brew install gcc6
Java 8$ brew update; brew tap caskroom/cask; brew install Caskroom/cask/java

Default Run

Phrase Mining Step

$ ./auto_phrase.sh

The default run will download an English corpus from the server of our datamining group and run AutoPhrase to get 3 ranked lists of phrases as well as 2 segmentation model files under theMODEL (i.e.,models/DBLP) directory.

AutoPhrase.txt: the unified ranked list for both single-word phrases and multi-word phrases.
AutoPhrase_multi-words.txt: the sub-ranked list for multi-word phrases only.
AutoPhrase_single-word.txt: the sub-ranked list for single-word phrases only.
segmentation.model: AutoPhrase's segmentation model (saved for later use).
token_mapping.txt: the token mapping file for the tokenizer (saved for later use).

You can changeRAW_TRAIN to your own corpus and you may also want changeMODEL to a different name.

Phrasal Segmentation

We also provide an auxiliary function to highlight the phrases in context based on our phrasal segmentation model. There are two thresholds you can tune in the top of the script. The model can also handle unknown tokens (i.e., tokens which are not occurred in the phrase mining step's corpus).

In the beginning, you need to specify AutoPhrase's segmentation model, i.e.,MODEL. The default value is set to be consistent withauto_phrase.sh.

$ ./phrasal_segmentation.sh

The segmentation results will be put under theMODEL directory as well (i.e.,model/DBLP/segmentation.txt). The highlighted phrases will be enclosed by the phrase tags (e.g.,<phrase>data mining</phrase>).

Incorporate Domain-Specific Knowledge Bases

If domain-specific knowledge bases are available, such as MeSH terms, there are two ways to incorporate them.

(recommended) Append your known quality phrases to the filedata/EN/wiki_quality.txt.
Replace the filedata/EN/wiki_quality.txt by your known quality phrases.

Handle Other Languages

Tokenizer and POS tagger

In fact, our tokenizer supports many different languages, including Arabics (AR), German (DE), English (EN), Spanish (ES), French (FR), Italian (IT), Japanese (JA), Portuguese (PT), Russian (RU), and Chinese (CN). If the language detection is wrong, you can also manually specify the language by modify theTOKENIZER command in the bash scriptauto_phrase.sh using the two-letter code for that language. For example, the following one forces the language to be English.

TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l EN"

We also provide a default tokenizer together with a dummy POS tagger in thetools/tokenizer.It uses the StandardTokenizer in Lucene, and always assign a tagUNKNOWN to each token.To enable this feature, please add the-l OTHER" to theTOKENIZER command in the bash scriptauto_phrase.sh.

TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l OTHER"

If you want to incorporate your own tokenizer and/or POS tagger, please create a new class extending SpecialTagger in thetools/tokenizer. You may refer to StandardTagger as an example.

stopwords.txt

You may try to search online or create your own list.

wiki_all.txt and wiki_quality.txt

Meanwhile, you have to add two lists of quality phrases in thedata/OTHER/wiki_quality.txt anddata/OTHER/wiki_all.txt.The quality of phrases in wiki_quality should be very confident, while wiki_all, as its superset, could be a little noisy. For more details, please refer to thetools/wiki_enities.

Use an already tokenized/preprocessed and POS tagged corpus

You can also use AutoPhrase with an already tokenized and tagged corpus.For this, you need to:

SetPOS_TAGGING_MODE=${POS_TAGGING_MODE:- 2} in bothauto_phrase.sh andphrasal_segmentation.sh scripts
Place apos_tags.txt file inside your data directory (eg.data/EN/pos_tags.txt)
Assure that the count of tags inpos_tags.txt is equal to the count of tokens indataset.txt.
Separate yourdataset.txt (input file) tokens using one-char delimiters. Set the delimiters in bothauto_phrase.sh andphrasal_segmentation.sh scripts (search for-delimiters).
- Eg.: If\n,\t and WhiteSpace are used as delimiters, set:

auto_phrase.sh:time java $TOKENIZER -m train -i $RAW_TRAIN -o $TOKENIZED_TRAIN -t $TOKEN_MAPPING -c N -thread $THREAD -delimiters "\n\t "phrasal_segmentation.shtime java $TOKENIZER -m direct_test -i $TEXT_TO_SEG -o $TOKENIZED_TEXT_TO_SEG -t $TOKEN_MAPPING -c N -thread $THREAD -delimiters "\n\t "

Note also that, by using such custom input, you can lemmatize or stemm your tokens beforehand and keep the already computed POS tags unchanged.

Docker

Default Run

sudo docker run -v $PWD/models:/autophrase/models -it \    -e ENABLE_POS_TAGGING=1 \    -e MIN_SUP=30 -e THREAD=10 \    remenberl/autophrase./auto_phrase.sh

The results will be available in themodels folder. Note that all of the environment variables above have their default values--leaving the assigments out here would produce exactly the same results. (However, in this case, using default values, the results ofphrasal_segmentation.txt would be saved to theinternaldefault_models directory--this is unavoidable, since the phrasal segmentation app reads from and writes to the same model directory.)

User Specified Input

Assuming the path to input file is ./data/input.txt.

sudo docker run -v $PWD/data:/autophrase/data -v $PWD/models:/autophrase/models -it \    -e RAW_TRAIN=data/input.txt \    -e ENABLE_POS_TAGGING=1 \    -e MIN_SUP=30 -e THREAD=10 \    -e MODEL=models/MyModel \    -e TEXT_TO_SEG=data/input.txt \    remenberl/autophrase./auto_phrase.sh

"RAW_TRAIN" is the training corpus, and "TEXT_TO_SEG" is a corpus whose phrases are to be highlighted--typically, this is the same corpus, but training and phrasal segmentation use two different scripts. When the user wants to segment a new corpus with an existing model, only the latter script need be used (and setting "RAW_TRAIN" isn't necessary).

Note that, in a Docker deployment, the (default)data andmodels directories are renamed todefault_data anddefault_models, respectively, to avoid conflicts withmounted external directories with the same names. It should be noted as well that there's litle point in saving a model to the default models directory, since all new files are erased whenthe container is exited (and if an external directory is mounted as "models", and no value is specified for "MODEL", the results will be saved in the "models/DBP" subdirectory). The samewrinkle also means that there's little point to running a container with the "FIRST_RUN" variable set to 0.

Because the original data directory will have been been renamed, it's perfectly fine for the user to mount an external directory called "data" and read the corpus from there--and in mostcases, there's no need for a user to change the supplied files stored in the default data directory. If such a change is necessary, though, the environment variable that specifies thedirectory in question is "DATA_DIR".

In Windows

Thesudo command won't work in a Windows bash shell, and in any case isn't needed in an elevated window--replace it withwinpty.

In addition, thePWD variable works a little oddly in MinGW (the Git bash shell), appending ";C" to the end of the path. To prevent this, replace$PWD/models:/autophrase/models with"/${PWD}/models":/autophrase/models, and$PWD/data/autophrase/data with"/${PWD}/data:/autophrase/data.

About

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Movatterモバイル変換

License

shangjingbo1226/AutoPhrase

Folders and files

Latest commit

History

Repository files navigation

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

Publications

Recent Changes

2020.06.14

2018.03.04

2017.10.23

New Features

Related GitHub Repositories

Requirements

Default Run

Phrase Mining Step

Phrasal Segmentation

Incorporate Domain-Specific Knowledge Bases

Handle Other Languages

Tokenizer and POS tagger

stopwords.txt

wiki_all.txt and wiki_quality.txt

Use an already tokenized/preprocessed and POS tagged corpus

Docker

Default Run

User Specified Input

In Windows

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors12

Uh oh!

Languages

Packages