plandes/clj-nlp-parsePublic

NotificationsYou must be signed in to change notification settings
Fork2
Star39

Natural Language Parsing and Feature Generation

License

MIT license

39 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
doc		doc
resources/nlparse		resources/nlparse
src		src
test-resources		test-resources
test/zensols/nlparse		test/zensols/nlparse
zenbuild @ 2648711		zenbuild @ 2648711
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
makefile		makefile
project.clj		project.clj

Repository files navigation

Natural Language Parse and Feature Generation

A Clojure language library to parse natural language text into features usefulfor machine learning model.

Features include:

Wraps several Java natural language parsing libraries.
Gives access the data structures rendered by the parsers.
Provides utility functions to create features.

This framework combines the results of the following frameworks:

Features

Callable from Java
Callable from REST
Callable from REST in aDocker Image
Completely customize.
Easily extendable.
Combines all annotations as pure Clojure data structures.
Provides a feature creation libraries:
Stitches multiple frameworks to provide the following features:
- Tokenizing
- Grouping Tokens into Sentences
- Lemmatisation
- Part of Speech Tagging
- Stop Words (both word andlemma)
- Named Entity Recognition
- Syntactic Parse Tree
- Fast Shift Reduce Parse Tree
- Dependency Tree
- Co-reference Graph
- Sentiment Analysis
- Semantic Role Labeler
Seamless itegration with other feature creation libraries:
- [General NLP feature creation]
- [Word vector feature creation]

Obtaining

In yourproject.clj file, add:

Documentation

API Documentation

Annotation Definitions

The utterance parse annotation treedefinitions isgiven here.

Example Parse

An example of a full annotation parse isgiven here.

Setup

The NER model is included in the Stanford CoreNLP dependencies, but you stillhave to download the POS model. To download (or create a symbolic link ifyou've set theZMODEL environment variable):

$ make model

If this doesn't work, followthemanual steps. Otherwiseyou can optionally move the model to a shared location on the file system andskip toconfiguring the REPL.

Download and Install POS Tagger Model Manually

If thenormal setup failed, you'll have to manually download the POStagger model.

The library can be configured to use any POS model (or NER for that matter),but by default it expectstheenglish-left3words-distsim.tagger model.

Create a directory where to put the model
```
$ mkdir -p path-to-model/stanford/pos
```
Download theenglish-left3words-distsim.tagger modelthe orsimilar model.

Install the model file:

$ unzip stanford-postagger-2015-12-09.zip$ mv stanford-postagger-2015-12-09/models/english-left3words-distsim.tagger path-to-model/stanford/pos

REPL

If you download the model in to any other location other that the current startdirectory (seesetup) you will have to tell the REPL where the modelis kept on the file system.

Start the REPL and configure:

user> (System/setProperty"zensols.model""path-to-model")

Note that system properties can be passed vialein to avoid having to repeatthis for each REPL instance.

Usage

This package supports:

Usage Example

See theexample repo thatillustrates how to use this library and contains the code from where theseexamples originate. It's highly recommended to clone it and follow along asyou peruse this README.

Parsing an Utterance

user> (require '[zensols.nlparse.parse:refer (parse)])user> (clojure.pprint/pprint (parse"I am Paul Landes."))=> {:text"I am Paul Landes.",:mentions ({:entity-type"PERSON",:token-range [24],:ner-tag"PERSON",:sent-index0,:char-range [516],:text"Paul Landes"}),:sents ({:text"I am Paul Landes.",:sent-index0,:parse-tree   {:label"ROOT",:child    ({:label"S",:child      ({:label"NP",:child ({:label"PRP",:child ({:label"I",:token-index1})})}...:dependency-parse-tree   ({:token-index4,:text"Landes",:child     ({:dep"nsubj",:token-index1,:text"I"}      {:dep"cop",:token-index2,:text"am"}      {:dep"compound",:token-index3,:text"Paul"}      {:dep"punct",:token-index5,:text"."})}),...:tokens   ({:token-range [01],:ner-tag"O",:pos-tag"PRP",:lemma"I",:token-index1,:sent-index0,:char-range [01],:text"I",:srl     {:id1,:propbank nil,:head-id2,:dependency-label"root",:heads ({:function-tag"PPT",:dependency-label"A1"})}}...

Utility Functions

There utility function to have with getting around the parsed data, as it canbe pretty large. For example, to find the head of the dependency head tree:

(defpanon (parse"I am Paul Landes."))=> {:text...user> (->> panon:sents first p/root-dependency:text)=>"Landes"

In this case, the last name is the head of tree and happens to be a namedentity as detected by the Stanford CoreNLP NER system. Named entities areannotatated at the token level, but also included in thementions top levelwith the entire set of concatenated tokens (for cases where an NER containsmore than one token like in this case). To get the full mention text:

user> (->> panon:sents first p/root-dependency                (p/mention-for-token panon)                first:text))=>"Paul Landes"

Feature Creation

This library was written to generate features for a machine learningalgoritms. There are some utility functions for doing this.

Other feature libraries the integrate with this library:

[General NLP feature creation]
[Word vector feature creation]

Below are examples of feature creation with just this library.

Get the first propbank parsed from the SRL:

user> (->> panon f/first-propbank-label)=>"be.01"

Get stats on features:

user> (->> panon p/tokens (f/token-features panon))=> {:utterance-length17,:mention-count1,:sent-count1,:token-count5,:token-average-length14/5,:is-questionfalse}

Each functionX has an analog functionX-feature-keys that describes thefeatures generates and their types, which can be used directly as Wekaattributes:

user> (clojure.pprint/pprint (f/token-feature-metas))=> [[:utterance-length numeric]    [:mention-count numeric][:sent-count numeric][:token-count numeric][:token-average-length numeric][:is-question boolean]]

Get in/out-of-vocabulary ratio:

user> (->> panon p/tokens f/dictionary-features)=> {:in-dict-ratio4/5}

Word count features provide distributions over word counts.See theunit test.

Stopword Filtering

Filter

user> (require '[zensols.nlparse.parse:as p])user> (require '[zensols.nlparse.stopword:as st])user> (->> (p/parse"This is a test.  This will filter 5 semantically significant words.")           p/tokens           st/go-word-forms)=> ("test""filter""semantically""significant""words")

See theunit test.

Dictionary Utility

See theNLP feature library formore information on dictionary specifics.

Pipeline Configuration

You can not only configure the natural language processing pipeline and whichspecific components to use, but you can also define and add your own pluginlibrary. See theconfig namespacefor more information.

Pipeline Usage

For example, if all you need is tokenization and sentence chunking create acontext and parse it using macrowith-context and the context you create withspecific components:

(require '[zensols.nlparse.config:as conf:refer (with-context)]         '[zensols.nlparse.parse:refer (parse)])(let [ctx (->> (conf/create-parse-config:pipeline [(conf/tokenize)                           (conf/sentence)])               conf/create-context)]  (with-context ctx    (parse"I love Clojure.  I enjoy it.")))

You can also specify the configuration in the form of a string:

(let [ctx (conf/create-context"tokenize,sentence,part-of-speech")]  (with-context ctx    (parse"I love Clojure.  I enjoy it.")))

The configuration string can also take parameters (ex theen parameter to thetokenizer specifying English as the natural language):

(let [ctx (conf/create-context"tokenize(en),sentence,part-of-speech")]  (with-context ctx    (parse"I love Clojure.  I enjoy it.")))

For an example on how to configure the pipeline, seethis test case.For more information on the DSL itself see theDSL parser.

Convenience Namespace

If you use a particular configuration that doesn't change often consider yourown utility parse namespace:

(nsexample.nlp.parse  (:require [zensols.nlparse.parse:as p]            [zensols.nlparse.config:as conf:refer (with-context)]))(defonce ^:privateparse-context-inst (atomnil))(defn-create-context []  (->> ["tokenize""sentence""part-of-speech""morphology""named-entity-recognizer""parse-tree"]       (clojure.string/join",")       conf/create-context))(defn-context []  (swap! parse-context-inst #(or % (create-context))))(defnparse [utterance]  (with-context (context)    (p/parse utterance)))

Now in your application namespace:

(nsexample.nlp.core  (:require [example.nlp.parse:as p]))(defnsomefn []  (p/parse"an utterance"))

Command Line Usage

The command line usage of this project has moved totheNLP server.

Building

To build from source, do the folling:

InstallLeiningen (this is just a script)
InstallGNU make
InstallGit
Download the source:git clone --recurse-submodules https://github.com/plandes/clj-nlp-parse && cd clj-nlp-parse
Build the software:make jar
Build the distribution binaries:make dist

Note that you can also build a single jar file with all the dependencies with:make uber

Changelog

An extensive changelog is availablehere.

Citation

If you use this software in your research, please cite with the followingBibTeX:

@misc{plandes-clj-nlp-parse,  author= {PaulLandes},  title= {NaturalLanguageParse andFeatureGeneration},  year= {2018},  publisher= {GitHub},  journal= {GitHub repository},  howpublished= {\url{https://github.com/plandes/clj-nlp-parse}}}

References

See the [General NLP feature creation] library for additional references.

@phdthesis{choi2014optimization,  title= {Optimization of natural language processing componentsfor robustness and scalability},  author= {Choi,JinhoD},  year= {2014},  school= {University ofColoradoBoulder}}@InProceedings{manning-EtAl:2014:P14-5,  author= {Manning,ChristopherD. andSurdeanu,Mihai  andBauer,John  andFinkel,Jenny  andBethard,StevenJ. andMcClosky,David},  title= {The {Stanford} {CoreNLP}NaturalLanguageProcessingToolkit},  booktitle= {AssociationforComputationalLinguistics (ACL)SystemDemonstrations},  year= {2014},  pages= {55--60},  url= {http://www.aclweb.org/anthology/P/P14/P14-5010}}

License

Permission is hereby granted, free of charge, to any person obtaining a copy ofthis software and associated documentation files (the "Software"), to deal inthe Software without restriction, including without limitation the rights touse, copy, modify, merge, publish, distribute, sublicense, and/or sell copiesof the Software, and to permit persons to whom the Software is furnished to doso, subject to the following conditions:

The above copyright notice and this permission notice shall be included in allcopies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS ORIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THEAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHERLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THESOFTWARE.

About

Natural Language Parsing and Feature Generation

Movatterモバイル変換

License

plandes/clj-nlp-parse

Folders and files

Latest commit

History

Repository files navigation

Natural Language Parse and Feature Generation

Table of Contents

Features

Obtaining

Documentation

API Documentation

Annotation Definitions

Example Parse

Setup

Download and Install POS Tagger Model Manually

REPL

Usage

Usage Example

Parsing an Utterance

Utility Functions

Feature Creation

Stopword Filtering

Dictionary Utility

Pipeline Configuration

Pipeline Usage

Convenience Namespace

Command Line Usage

Building

Changelog

Citation

References

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages