- Notifications
You must be signed in to change notification settings - Fork11
🎲 Efficient Java implementation of the probabilistic Earley algorithm to parse Stochastic Context Free Grammars (SCFGs)
License
cacfd3a/java-probabilistic-earley-parser
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This code is currently broken. It has a subtle bug which produces invalid results, can make your code run exponentially & could use exponential memory.
DO NOT USE THIS LIBRARY!!!
Pull requests are welcome, but seeing as this is GitHub, nobody will care & the project is thus effectively abandoned. Contactcacfd3a@duck.com if you really need a functioning Probabilistic Earley Parser enough so that you are willing to fund it.
This is a library for parsing a sequence of tokens (like words) into tree structures, along with the probability that the particular sequence generates that tree structure. This is mainly useful for linguistic purposes, such as morphological parsing, speech recognition and generally information extraction. It also finds applications in computational biology.
For example:
- As a computational linguist, you want toderive the most probable interpretation of an English sentence
| tokens | parse tree |
|---|---|
| [i, want, british, food] | ![]() |
- As a computational biologist, you want topredict the secondary structure for an RNA sequence (which you can encode in a tree structure)
| tokens | parse tree |
|---|---|
GGGC``UAUU``AGCU``CAGUUGGU``UAGA``GCGC``ACCCCUGA``UAAG``GGUG``AGGUCGCU``GAUU``CGAA``UUCAGCAU``AGCC``CA | ![]() |
As a cognitive scientist, you want tocompute the cognitive load of interpreting sentences as a function of "surprises" that happen as you read a sentence
As a computational linguist,you want to know the most likely table of contents structure for a list of paragraphs
This library allows you to do these thingsefficiently, as long as you can describe the rules as aContext-free Grammar (CFG).
The innovation of this library with respect to the many other parsing libraries is that this one allows the production rules in your grammar to have a probability attached to them. That is: it parses sentences usingStochastic Context-free Grammars. This allows us to make better choices in case of ambiguous sentences, because we can order multiple interpretations of the same sentence by probability. Furthermore, this parser does not limit token types to strings.
For a theoretical grounding of this work, refer toStolcke, An Efficient Probabilistic Context-FreeParsing Algorithm that Computes PrefixProbabilities.
You can use this project as a library in your Java application or as a standalone command-line app.
Downloadthe latest JAR
By default, the parser will assume that you distinguish non-terminals from terminals by capitalizing them. You can also add a custom category handler if you call the API from Java code.
Create a UTF8-encoded.cfg file that contains your grammar, such as the following:
# grammar.cfgS -> NP VP (0.8) # specify probability between 0 and 1 by appending between parenthesesS -> NP (0.2)NP -> Det N # probability defaults to 1.0NP -> Det Nom # all rules for a given category must sum to 1, so the builder normalizes probabilities to ensure they sum to 1.0Nom -> Adj NVP → V # Use '->' or '→'Det → the # probability defaults to 1.0N → heavy (0.2)Adj → heavy (0.8)V → heave (0.8)N → /heave[r]?/i (0.2) # You can specify terminals as regular expressions by enclosing them in '/'.Execute runnable jar on the terminal:
java -jar probabilistic-earley-parser-jar-with-dependencies.jar -i grammar.cfg -goal S the heavy heaveThis will give the Viterbi parse to the ambiguous sentence "the heavy heave":
0.128 (= 0.8 * 0.2 * 0.8)└── <start> └── S ├── NP │ ├── Det │ │ └── the (the) │ └── N │ └── heavy (heavy) └── VP └── V └── heave (heave)This is the parse with the semantic of "heavy people heave"
In contrast, the less likely parse was "a heave that is heavy":
0.032 (= 0.2 * 0.8 * 0.2)└── <start> └── S └── NP ├── Det │ └── the (the) └── Nom ├── Adj │ └── heavy (heavy) └── N └── heave (heave)The command line interface is meant for quickly trying out a simple grammar. For actual real-life parsing stuff, you probably want to use the Java API.
Grab from Maven:
<dependencies> <dependency> <groupId>org.leibnizcenter</groupId> <artifactId>probabilistic-earley-parser</artifactId> <version>0.9.12</version> </dependency></dependencies>
or Gradle:
compile'org.leibnizcenter:probabilistic-earley-parser:0.9.12'Or just include thethe latest JAR in your project.
Most applications will want to interface withParser, which you instantiate with a grammar:
publicclassExample {// NonTerminals are just wrappers around a stringprivatestaticfinalNonTerminalS =Category.nonTerminal("S");privatestaticfinalNonTerminalNP =Category.nonTerminal("NP");privatestaticfinalNonTerminalVP =Category.nonTerminal("VP");privatestaticfinalNonTerminalTV =Category.nonTerminal("TV");privatestaticfinalNonTerminalDet =Category.nonTerminal("Det");privatestaticfinalNonTerminalN =Category.nonTerminal("N");privatestaticfinalNonTerminalMod =Category.nonTerminal("Mod");// Terminal types are realized by implementing the Terminal interface, specifically the function hasCategory. Terminal is a functional interface.// Note that tokens can be of multiple terminal types (homographs: "bank" as a noun or "bank" as a verb), so you can use this method to pool many words to a single terminalprivatestaticfinalTerminal<String>transitiveVerb =token ->token.obj.matches("(hit|chased)");// Some utility terminal types are pre-defined:privatestaticfinalTerminal<String>the =newCaseInsensitiveStringTerminal("the");privatestaticfinalTerminal<String>a =newCaseInsensitiveStringTerminal("a");privatestaticfinalTerminal<String>man =newExactStringTerminal("man");privatestaticfinalTerminal<String>stick =newExactStringTerminal("stick");privatestaticfinalTerminal<String>with =newExactStringTerminal("with");privatestaticfinalGrammargrammar =newGrammar.Builder("test") .setSemiring(LogSemiring.get())// If not set, defaults to Log semiring which is probably what you want. The builder takes care of converting probabilties to semiring elements .addRule(1.0,// Probability between 0.0 and 1.0, defaults to 1.0S,// Left hand side of the ruleNP,VP// Right hand side of the rule ) .addRule(NP,Det,N// eg. The man ) .addRule(NP,Det,N,Mod// eg. The man (with a stick) ) .addRule(0.4,VP,TV,NP,Mod// eg. (chased) (the man) (with a stick) ) .addRule(0.6,VP,TV,NP// eg. (chased) (the man with a stick) ) .addRule(Det,a) .addRule(Det,the) .addRule(N,man) .addRule(N,stick) .addRule(TV,transitiveVerb) .addRule(Mod,with,NP)// eg. with a stick .build();publicstaticvoidmain(String[]args) {Parser<String>parser =newParser<>(grammar);System.out.println(parser.recognize(S,Tokens.tokenize("The man chased the man\n\t with a stick"))// true );System.out.println(parser.getViterbiParseWithScore(S,Tokens.tokenize("the","stick","chased","the","man"))// Most likely parse tree with probability ); }}
You can parse.cfg files as follows:
Grammar<String> g = Grammar.fromString(Paths.get("path", "to", "grammar.cfg"), Charset.forName("UTF-8"));One of the advantages of Earley parsing is the top-down control you can exert while parsing.You can pass the parser callbacks to influence the parsing process. Only use this if you really know what you're doing. It may mess up your results if you are not careful.
new ParseCallbacks.Builder() .withOnPreScan((position, token, chart) -> System.out.println("Scan about to happen for token " + token)) .withScanProbability((position, token) -> { if (token.getCategories().contains(anUnexpectedTerminalForThisWord)) { return grammar.semiring.fromProbability(0.5); } else { return grammar.semiring.one(); } }) .withOnPostScan((position, token, chart) -> System.out.println("Scan happened for token " + token)) .withOnPostComplete((position, token, chart) -> System.out.println("Complete happened for token " + token)) .build()The Earley algorithm has nice complexity properties. In particular, it canparse:
- any CFG in O(n³),
- unambiguous CFGs in O(n²)
- left-recursive unambiguous grammars in O(n). For example,
S → S ais a left-recursive rule.
Note that this implementation does not apply innovations such asJoop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement these ideas and still have a probabilistic parser.
For an efficient parser that works only on non-probabilistic context-free grammars, look intoMarpa. Marpa is a C library with a Perl interface, and a Lua interface is underway. It is currently painful to embed within a Java project, however.
Pull requests for these issues are welcome:
- I have not provisioned for ε-rules (empty right-hand sign).Issue.
- Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented.Issue.
- Higher level concepts such as * and + are not implemented
- Error handling / logging could be better, available as an experimental feature.Issue.
- Viterbi parsing only returns one single parse. In the case of an ambiguous sentence, the returned parse is not guaranteed the left-most parse.
- Behavior for strangely defined grammars is not defined, such as when the same rule is defined multiple times witha different probability.Issue
The probability of a parse is defined as the product of the probabilities of all the applied rules. Usually,we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition andmultiplication.
Implementing this naively can lead to problems: imagine a computation like 0.1 * 0.1 * ... * 0.1. At some point, floating point arithmetic will be unable to represent a number so small (arithmetic underflow). To counter, we map numbers between 0 and 1 to numbers between 0 and infinity with the Log semiring, which allows us to represent a lot more numbers (everything between 0 andDouble.MAX_VALUE).
This project makes it possible to useany commutativesemiring that can have its elementsrepresented as doubles. I can't really imagine a use case for using another semiring than the Log semiring, but who knows, maybe you can.
This software is licensed under a permissiveMIT license.
- Stolcke, Andreas. "An efficient probabilistic context-free parsing algorithm that computes prefix probabilities.Computational linguistics 21.2 (1995): 165-201.
- Leo, Joop MIM. "A general context-free parsing algorithm running in linear time on every LR (k) grammar without using lookahead." Theoretical computer science 82.1 (1991): 165-176.
- Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer." (2012): 115.
Inquiries go to cacfd3a atcacfd3a@duck.com
About
🎲 Efficient Java implementation of the probabilistic Earley algorithm to parse Stochastic Context Free Grammars (SCFGs)
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.

