cacfd3a/java-probabilistic-earley-parserPublic

NotificationsYou must be signed in to change notification settings
Fork11
Star36

🎲 Efficient Java implementation of the probabilistic Earley algorithm to parse Stochastic Context Free Grammars (SCFGs)

License

MIT license

36 stars 11 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
grammar.cfg		grammar.cfg
settings.gradle		settings.gradle

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

This code is currently broken. It has a subtle bug which produces invalid results, can make your code run exponentially & could use exponential memory.

DO NOT USE THIS LIBRARY!!!

Pull requests are welcome, but seeing as this is GitHub, nobody will care & the project is thus effectively abandoned. Contactcacfd3a@duck.com if you really need a functioning Probabilistic Earley Parser enough so that you are willing to fund it.

This is a library for parsing a sequence of tokens (like words) into tree structures, along with the probability that the particular sequence generates that tree structure. This is mainly useful for linguistic purposes, such as morphological parsing, speech recognition and generally information extraction. It also finds applications in computational biology.

For example:

As a computational linguist, you want toderive the most probable interpretation of an English sentence

tokens	parse tree
[i, want, british, food]

As a computational biologist, you want topredict the secondary structure for an RNA sequence (which you can encode in a tree structure)

tokens	parse tree
GGGC``UAUU``AGCU``CAGU UGGU``UAGA``GCGC``ACCC CUGA``UAAG``GGUG``AGGU CGCU``GAUU``CGAA``UUCA GCAU``AGCC``CA

As a cognitive scientist, you want tocompute the cognitive load of interpreting sentences as a function of "surprises" that happen as you read a sentence
As a computational linguist,you want to know the most likely table of contents structure for a list of paragraphs

This library allows you to do these thingsefficiently, as long as you can describe the rules as aContext-free Grammar (CFG).

The innovation of this library with respect to the many other parsing libraries is that this one allows the production rules in your grammar to have a probability attached to them. That is: it parses sentences usingStochastic Context-free Grammars. This allows us to make better choices in case of ambiguous sentences, because we can order multiple interpretations of the same sentence by probability. Furthermore, this parser does not limit token types to strings.

For a theoretical grounding of this work, refer toStolcke, An Efficient Probabilistic Context-FreeParsing Algorithm that Computes PrefixProbabilities.

Usage

You can use this project as a library in your Java application or as a standalone command-line app.

Command line

Downloadthe latest JAR

By default, the parser will assume that you distinguish non-terminals from terminals by capitalizing them. You can also add a custom category handler if you call the API from Java code.

Create a UTF8-encoded.cfg file that contains your grammar, such as the following:

# grammar.cfgS   -> NP VP   (0.8)    # specify probability between 0 and 1 by appending between parenthesesS   -> NP      (0.2)NP  -> Det N            # probability defaults to 1.0NP  -> Det Nom          # all rules for a given category must sum to 1, so the builder normalizes probabilities to ensure they sum to 1.0Nom -> Adj NVP  →  V                # Use '->' or '→'Det →  the              # probability defaults to 1.0N   →  heavy   (0.2)Adj →  heavy   (0.8)V   →  heave   (0.8)N   →  /heave[r]?/i   (0.2) # You can specify terminals as regular expressions by enclosing them in '/'.

Execute runnable jar on the terminal:

java -jar probabilistic-earley-parser-jar-with-dependencies.jar -i grammar.cfg -goal S the heavy heave

This will give the Viterbi parse to the ambiguous sentence "the heavy heave":

0.128 (= 0.8 * 0.2 * 0.8)└── <start>    └── S        ├── NP        │   ├── Det        │   │   └── the (the)        │   └── N        │       └── heavy (heavy)        └── VP            └── V                └── heave (heave)

This is the parse with the semantic of "heavy people heave"

In contrast, the less likely parse was "a heave that is heavy":

0.032 (= 0.2 * 0.8 * 0.2)└── <start>    └── S        └── NP            ├── Det            │   └── the (the)            └── Nom                ├── Adj                │   └── heavy (heavy)                └── N                    └── heave (heave)

The command line interface is meant for quickly trying out a simple grammar. For actual real-life parsing stuff, you probably want to use the Java API.

Java API

Grab from Maven:

<dependencies>        <dependency>            <groupId>org.leibnizcenter</groupId>            <artifactId>probabilistic-earley-parser</artifactId>            <version>0.9.12</version>        </dependency></dependencies>

or Gradle:

compile'org.leibnizcenter:probabilistic-earley-parser:0.9.12'

Or just include thethe latest JAR in your project.

Most applications will want to interface withParser, which you instantiate with a grammar:

publicclassExample {// NonTerminals are just wrappers around a stringprivatestaticfinalNonTerminalS =Category.nonTerminal("S");privatestaticfinalNonTerminalNP =Category.nonTerminal("NP");privatestaticfinalNonTerminalVP =Category.nonTerminal("VP");privatestaticfinalNonTerminalTV =Category.nonTerminal("TV");privatestaticfinalNonTerminalDet =Category.nonTerminal("Det");privatestaticfinalNonTerminalN =Category.nonTerminal("N");privatestaticfinalNonTerminalMod =Category.nonTerminal("Mod");// Terminal types are realized by implementing the Terminal interface, specifically the function hasCategory. Terminal is a functional interface.// Note that tokens can be of multiple terminal types (homographs: "bank" as a noun or "bank" as a verb), so you can use this method to pool many words to a single terminalprivatestaticfinalTerminal<String>transitiveVerb =token ->token.obj.matches("(hit|chased)");// Some utility terminal types are pre-defined:privatestaticfinalTerminal<String>the =newCaseInsensitiveStringTerminal("the");privatestaticfinalTerminal<String>a =newCaseInsensitiveStringTerminal("a");privatestaticfinalTerminal<String>man =newExactStringTerminal("man");privatestaticfinalTerminal<String>stick =newExactStringTerminal("stick");privatestaticfinalTerminal<String>with =newExactStringTerminal("with");privatestaticfinalGrammargrammar =newGrammar.Builder("test")            .setSemiring(LogSemiring.get())// If not set, defaults to Log semiring which is probably what you want. The builder takes care of converting probabilties to semiring elements            .addRule(1.0,// Probability between 0.0 and 1.0, defaults to 1.0S,// Left hand side of the ruleNP,VP// Right hand side of the rule            )            .addRule(NP,Det,N// eg. The man            )            .addRule(NP,Det,N,Mod// eg. The man (with a stick)            )            .addRule(0.4,VP,TV,NP,Mod// eg. (chased) (the man) (with a stick)            )            .addRule(0.6,VP,TV,NP// eg. (chased) (the man with a stick)            )            .addRule(Det,a)            .addRule(Det,the)            .addRule(N,man)            .addRule(N,stick)            .addRule(TV,transitiveVerb)            .addRule(Mod,with,NP)// eg. with a stick            .build();publicstaticvoidmain(String[]args) {Parser<String>parser =newParser<>(grammar);System.out.println(parser.recognize(S,Tokens.tokenize("The man     chased the man\n\t with a stick"))// true        );System.out.println(parser.getViterbiParseWithScore(S,Tokens.tokenize("the","stick","chased","the","man"))// Most likely parse tree with probability        );    }}

You can parse.cfg files as follows:

Grammar<String> g = Grammar.fromString(Paths.get("path", "to", "grammar.cfg"), Charset.forName("UTF-8"));

One of the advantages of Earley parsing is the top-down control you can exert while parsing.You can pass the parser callbacks to influence the parsing process. Only use this if you really know what you're doing. It may mess up your results if you are not careful.

new ParseCallbacks.Builder()                        .withOnPreScan((position, token, chart) -> System.out.println("Scan about to happen for token " + token))                        .withScanProbability((position, token) -> {                            if (token.getCategories().contains(anUnexpectedTerminalForThisWord)) {                                return grammar.semiring.fromProbability(0.5);                            } else {                                return grammar.semiring.one();                            }                        })                        .withOnPostScan((position, token, chart) -> System.out.println("Scan happened for token " + token))                        .withOnPostComplete((position, token, chart) -> System.out.println("Complete happened for token " + token))                        .build()

Some notes on implementation

Runtime complexity

The Earley algorithm has nice complexity properties. In particular, it canparse:

any CFG in O(n³),
unambiguous CFGs in O(n²)
left-recursive unambiguous grammars in O(n). For example,S → S a is a left-recursive rule.

Note that this implementation does not apply innovations such asJoop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement these ideas and still have a probabilistic parser.

For an efficient parser that works only on non-probabilistic context-free grammars, look intoMarpa. Marpa is a C library with a Perl interface, and a Lua interface is underway. It is currently painful to embed within a Java project, however.

Limitations

Pull requests for these issues are welcome:

I have not provisioned for ε-rules (empty right-hand sign).Issue.
Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented.Issue.
Higher level concepts such as * and + are not implemented
Error handling / logging could be better, available as an experimental feature.Issue.
Viterbi parsing only returns one single parse. In the case of an ambiguous sentence, the returned parse is not guaranteed the left-most parse.
Behavior for strangely defined grammars is not defined, such as when the same rule is defined multiple times witha different probability.Issue

Semirings

The probability of a parse is defined as the product of the probabilities of all the applied rules. Usually,we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition andmultiplication.

Implementing this naively can lead to problems: imagine a computation like 0.1 * 0.1 * ... * 0.1. At some point, floating point arithmetic will be unable to represent a number so small (arithmetic underflow). To counter, we map numbers between 0 and 1 to numbers between 0 and infinity with the Log semiring, which allows us to represent a lot more numbers (everything between 0 andDouble.MAX_VALUE).

This project makes it possible to useany commutativesemiring that can have its elementsrepresented as doubles. I can't really imagine a use case for using another semiring than the Log semiring, but who knows, maybe you can.

License

This software is licensed under a permissiveMIT license.

References

Contact

Inquiries go to cacfd3a atcacfd3a@duck.com

About

🎲 Efficient Java implementation of the probabilistic Earley algorithm to parse Stochastic Context Free Grammars (SCFGs)

Releases3

v0.10.0 Latest

Dec 30, 2017

+ 2 releases

Packages

No packages published

Languages

Java100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

Usage

Command line

Java API

Some notes on implementation

Runtime complexity

Limitations

Semirings

License

References

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Languages

Movatterモバイル変換

License

cacfd3a/java-probabilistic-earley-parser

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

Usage

Command line

Java API

Some notes on implementation

Runtime complexity

Limitations

Semirings

License

References

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Languages

Packages