cacfd3a/probabilistic-earley-parser-javascriptPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star56

🎲 An efficient implementation of a probabilistic Context Free Grammar parser in Javascript

License

MIT license

56 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
src		src
test		test
.babelrc		.babelrc
.gitignore		.gitignore
.npmignore		.npmignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json
version.js		version.js
webpack.config.js		webpack.config.js

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

This code is currently broken. It has a subtle bug which produces invalid results, can make your code run exponentially & could use exponential memory.

DO NOT USE THIS LIBRARY!!!

Pull requests are welcome, but seeing as this is GitHub, nobody will care & the project is thus effectively abandoned. Contactmaartentrompper@freedom.nl if you really need a functioning Probabilistic Earley Parser enough so that you are willing to fund it.

This is a library for parsing a sequence of tokens (like words) into tree structures, along with the probability that the particular sequence generates that tree structure. This is mainly useful for linguistic purposes, such as morphological parsing, speech recognition and generally information extraction. It also finds applications in computational biology.

For example:

As a computational linguist, you wantderive all ways to interpret an English sentence along with probabilities

tokens	parse tree
[i, want, british, food]

As a computational biologist, you want topredict the secondary structure for an RNA sequence

tokens	parse tree
GGGC``UAUU``AGCU``CAGU UGGU``UAGA``GCGC``ACCC CUGA``UAAG``GGUG``AGGU CGCU``GAUU``CGAA``UUCA GCAU``AGCC``CA

As a computational linguist,you want to know the most likely table of contents structure for a list of paragraphs

This library allows you to do these thingsefficiently, as long as you can describe the rules as aContext-free Grammar (CFG).

The innovation of this library with respect to the many other parsing libraries is that this one allows the production rules in your grammar to have a probability attached to them. That is: it parsesStochastic Context-free Grammars. This allows us to make better choices in case of ambiguous sentences: we can order them by probability. If you do not need probabilities attached to your parse trees, you are probably better off usingnearley instead.

For a theoretical grounding of this work, refer toStolcke; An Efficient Probabilistic Context-FreeParsing Algorithm that Computes PrefixProbabilities.

Motivation

While libraries for nondeterministic grammars abound, I could not find an existing JavaScriptimplementation of the Probabilistic Earley Parser. I have made a stochastic CYK parser before, but I wanted somethingmore top down that makes it easier to intervene in the parsing process,for instance when an unexpected token is encountered.In many cases Earley also parses faster than CYK (sparse grammars) and it doesn't require the grammar to berewritten in any normal form.

Usage

Get the most likely parse tree (theViterbi parse) for the sentence "the man chases the man with a stick":

import{getViterbiParse,Grammar}from'probabilistic-earley-parser';importtreeifyfrom'treeify';// Nonterminals are stringconstS="S";// : NonTerminalconstNP="NP";// : NonTerminalconstVP="VP";// : NonTerminalconstTV="TV";// : NonTerminalconstDet="Det";// : NonTerminalconstN="N";// : NonTerminalconstMod="Mod";// : NonTerminal// Terminals are functions that should return true when the parameter is of given typeconsttransitiveVerb=(token)=>!!token.match(/(hit|chased)/);// : Terminal<string>constthe=(token)=>!!token.match(/the/i);// : Terminal<string>consta=(token)=>!!token.match(/a/i);// : Terminal<string>constman=(token)=>!!token.match(/man/);// : Terminal<string>conststick=(token)=>!!token.match(/stick/);// : Terminal<string>constwith_=(token)=>!!token.match(/with/);// : Terminal<string>constgrammar=Grammar.builder("test")//: Grammar<string,number>.addNewRule(1.0,// Probability between 0.0 and 1.0, defaults to 1.0. The builder takes care of converting it to the semiring elementS,// Left hand side of the rule[NP,VP]// Right hand side of the rule)// NP -> Det N (1.0).addNewRule(1.0,NP,[Det,N]// eg. The man)// NP -> Det N Mod (1.0).addNewRule(1.0,NP,[Det,N,Mod]// eg. The man (with a stick))// VP -> TV NP Mod (0.4).addNewRule(0.4,VP,[TV,NP,Mod]// eg. (chased) (the man) (with a stick))// VP -> TV NP (0.6).addNewRule(0.6,VP,[TV,NP]// eg. (chased) (the man with a stick)).addNewRule(1.0,Det,[a]).addNewRule(1.0,Det,[the]).addNewRule(1.0,N,[man]).addNewRule(1.0,N,[stick]).addNewRule(1.0,TV,[transitiveVerb]).addNewRule(1.0,Mod,[with_,NP])// eg. with a stick.build();consttokens=["The","man","chased","the","man","with","a","stick"];constviterbi=getViterbiParse(S,grammar,tokens);// : ParseTreeWithScore<string>console.log(viterbi.probability);// 0.6/*0.6└─ S   ├─ NP   │  ├─ Det   │  │  └─ The   │  └─ N   │     └─ man   └─ VP      ├─ TV      │  └─ chased      └─ NP         ├─ Det         │  └─ the         ├─ N         │  └─ man         └─ Mod            ├─ with            └─ NP               ├─ Det               │  └─ a               └─ N                  └─ stick*/functionprintTree(tree){functionmakeTree(o){if(o.children&&o.children.length>0){constobj={};for(vari=0;i<o.children.length;i++){constname=o.children[i].token?o.children[i].token:o.children[i].category;obj[name]=makeTree(o.children[i]);}returnobj;}else{if(o.token){returno.token;}else{returno.category;}}}console.log(treeify.asTree(makeTree(tree)));}printTree(viterbi.parseTree);

You may pass a function to the parser with an addition probability multiplier for parsed tokens for additional logic that is hard to capture in a grammar. It is also possible to definepredict,scan andcomplete callbacks, but not currently implemented. (Pull requests welcome!)

Some notes on implementation

Written in TypeScript, published as acommonjs module on NPM (ES6; use--harmony_collections flag if your Node version is < 6) and asingle-file minified UMD module on Github in vulgar ES5.

This is an implementation of a probabilistic Earley parsing algorithm, which can parse any Probabilistic Context Free Grammar (PCFG) (alsoknown as Stochastic Context Free Grammar (SCFG)),or equivalently any language described in Backus-Naur Form (BNF). In these grammars,rewrite rules may be non-deterministic and have a probability attached to them.

The probability of a parse is defined as the product of the probalities all the applied rules. Usually,we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition andmultiplication.

This code makes it possible to useanysemiring for computingscores. My use for this is to avoid arithmetic underflow: imagine a computation like 0.1 * 0.1 * ... * 0.1.At some point, floating point arithmetic will be unable to represent a number so small. To counter, we use the Logsemiring which holds the minus log of the probability. So that maps the numbers 0 and 1 to the numbersbetween infinity and zero, skewed towards lower probabilities:

Graph plot of f(x) = -log(x)

Runtime complexity

The Earley algorithm has nice complexity properties. In particular, it canparse:

any CFG in O(n³),
unambiguous CFGs in O(n²)
left-recursive unambiguous grammars in O(n)

Note that this implementation does not apply innovations such asJoop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement this: making the parser stochastic is not as easy for Earley as it is for CYK.

For a faster parser that work on non-probabilistic grammars, look intonearley.

Limitations

I have not provisioned for ε-rules (rules with an empty right hand side)
Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented
Higher level concepts such as wildcards, * and + are not implemented
Viterbi parsing (querying the most likely parse tree) only returns one single parse. In the case of an ambiguous sentence in which multiple dervation have the highest probability, the returned parse is not guaranteed the left-most parse (I think).

License

This software is licensed under a permissiveMIT license.

References

Stolcke, Andreas. "An efficient probabilistic context-free parsing algorithm that computes prefix probabilities."Computational linguistics 21.2 (1995): 165-201.APA

Contact

Inquiries go tomaarten.trompper@gmail.com

About

🎲 An efficient implementation of a probabilistic Context Free Grammar parser in Javascript

Releases2

v0.9.3 Latest

Jan 8, 2017

+ 1 release

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

Motivation

Usage

Some notes on implementation

Graph plot of f(x) = -log(x)

Runtime complexity

Limitations

License

References

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Languages

Movatterモバイル変換

License

cacfd3a/probabilistic-earley-parser-javascript

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Earley parser

⚠️ Warning

Motivation

Usage

Some notes on implementation

Graph plot of f(x) = -log(x)

Runtime complexity

Limitations

License

References

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Languages

Packages