A natural language parser is a program that works out the grammaticalstructure of sentences, for instance, which groups of words go together(as "phrases") and which words are thesubject orobject of averb. Probabilistic parsers use knowledge of language gained fromhand-parsed sentences to try to produce themost likely analysis of newsentences. These statistical parsers still make some mistakes, butcommonly work rather well. Their development was one of the biggest breakthroughs innatural language processing in the 1990s. You cantry out our parseronline.
This package is a Java implementation of probabilistic natural languageparsers, both highly optimized PCFG and lexicalized dependency parsers, and alexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein,with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specificmodeling, flexible input/output, grammar compaction, lattice parsing,k-best parsing,typed dependencies output,user support, etc.) has been done by Roger Levy, Christopher Manning,Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, BillMacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, WolfgangMaier, and Jenny Finkel.
The lexicalized probabilistic parser implements a factored productmodel, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm.Or the software can be used simply as an accurate unlexicalized stochasticcontext-free grammar parser.Either of these yields a good performance statistical parsing system.A GUI is provided for viewing the phrase structure tree output of the parser.
As well as providing anEnglish parser, the parser can beand has been adapted to work with other languages.AChinese parser based on the Chinese Treebank, aGermanparser based on the Negra corpus andArabic parsers based on the Penn Arabic Treebank are also included.The parser has also been used for other languages, such as Italian,Bulgarian, and Portuguese.
The parser providesUniversal Dependencies (v1) and Stanford Dependencies output as well as phrase structure trees. Typed dependencies areotherwise knowngrammatical relations. This style of output is available only for English and Chinese.For more details, please refer to theStanford Dependencies webpage and theUniversal Dependencies v1 documentation. (See alsothe current Universal Dependencies documentation, but we are yet to update to it.).
As of version 3.4 in 2014, the parser includes the code necessary to run ashift reduce parser, a much faster constituent parser with competitive accuracy. Models for this parser are linked below.
In version 3.5.0 (October 2014) we released ahigh-performance dependency parser powered by a neural network. The parser outputs typed dependency parses for English and Chinese. The models for this parser are included in the general Stanford Parser models package.
The package includes a tool for scoring of generic dependency parses, in a classedu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool.
The current version of the parser requires Java 8 or later.(You can also download an old version of the parser, version 1.4,which runs under JDK 1.4, version 2.0 which runs under JDK 1.5, version 3.4.1which runs under JDK 1.6, but those distributions are no longer supported.)The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model).
The parser is available for download,licensed under theGNUGeneral Public License (v2 or later). Source is included. The packageincludes components for command-line invocation, a Java parsingGUI, and a Java API.
The download is a 261 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.
The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under thefull GPL,which allows many free uses.For distributors ofproprietarysoftware,commercial licensing is available.If you don't need a commercial license, but would like to supportmaintenance of these tools, we welcome gift funding: usethis form and write "Stanford NLP Group open source software" in the Special Instructions.
The main technical ideas behind how these parsers work appear in thesepapers. Feel free to cite one or more of the following papers or people depending on what youare using. Since the parser is regularly updated, we appreciate it ifpapers with numerical results reflecting parser performance mention theversion of the parser being used!
For the neural-network dependency parser:
Danqi Chen and Christopher D Manning. 2014.A Fast and Accurate Dependency Parser using Neural Networks.Proceedings of EMNLP 2014
For the Compositional Vector Grammar parser (starting at version 3.2):
Richard Socher, John Bauer, Christopher D. Manning and Andrew Y. Ng. 2013.Parsing With Compositional Vector Grammars.Proceedings of ACL 2013
For the Shift-Reduce Constituency parser (starting at version 3.2):
This parser was written by John Bauer. You can thank him and cite the web page describing it:https://nlp.stanford.edu/software/srparser.html. You can also cite the original research papers of others mentioned on that page.
For the PCFG parser (which also does POS tagging):
Dan Klein and Christopher D. Manning. 2003.Accurate Unlexicalized Parsing.Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423-430.
For the factored parser (which also does POS tagging):
Dan Klein and Christopher D. Manning. 2003.Fast Exact Inference with a Factored Model for Natural Language Parsing. InAdvancesin Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.
For the Universal Dependencies representation:
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič,Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira,Reut Tsarfaty, and Daniel Zeman. 2016.Universal Dependencies v1: A Multilingual Treebank Collection. InLREC 2016.
For the English Universal Dependencies converter and the enhanced English Universal Dependencies representation:
Sebastian Schuster and Christopher D. Manning. 2016.Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks.InLREC 2016.
For the (English) Stanford Dependencies representation:
Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006.GeneratingTyped Dependency Parses from Phrase Structure Parses. InLREC 2006.
For the German parser:
Anna Rafferty and Christopher D. Manning. 2008.Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines.InACL Workshop on Parsing German.
For the Chinese Parser:
Roger Levy and Christopher D. Manning.2003.Is it harder to parse Chinese, or the Chinese Treebank?ACL 2003, pp. 439-446.
For the Chinese Stanford Dependencies:
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning.2009.Discriminative Reordering with Chinese Grammatical Relations Features.InProceedings of the Third Workshop on Syntax and Structure in Statistical Translation.
For the Arabic parser:
Spence Green and Christopher D. Manning.2010.Better Arabic Parsing: Baselines, Evaluations, and Analysis.InCOLING 2010.
For the French parser:
Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning.2010.Multiword Expression Identification with Tree Substitution Grammars: A Parsingtour de force with French..InEMNLP 2011.
For the Spanish parser:
Most of the work on Spanish was by Jon Gauthier. There is no published paper, but you can thank him and/or citethis webpage:https://nlp.stanford.edu/software/spanish-faq.html
lexparser packagedocumentation andLexicalizedParser class documentation.(Point your web browser at theindex.html file in the includedjavadoc directory and navigate to those items.)Java
PHP
Python/Jython
Ruby
.NET / F# / C#
OS X
| Version 4.2.0 | 2020-11-17 | Retrain English models with treebank fixes | arabic chinese english french german spanish |
| Version 4.0.0 | 2020-05-22 | Model tokenization updated to UDv2.0 | arabic chinese english french german spanish |
| Version 3.9.2 | 2018-10-17 | Updated for compatibility | arabic chinese english french german spanish |
| Version 3.9.1 | 2018-02-27 | new French and Spanish UD models, misc. UD enhancements, bug fixes | arabic chinese english french german spanish |
| Version 3.8.0 | 2017-06-09 | Updated for compatibility | arabic chinese english french german spanish |
| Version 3.7.0 | 2016-10-31 | new UD models | arabic chinese english french german spanish |
| Version 3.6.0 | 2015-12-09 | Updated for compatibility | chinese english french german spanish |
| Version 3.5.2 | 2015-04-20 | Switch to universal dependencies | shift reduce parser models |
| Version 3.5.1 | 2015-01-29 | Dependency parser fixes and model improvements | shift reduce parser models |
| Version 3.5.0 | 2014-10-31 | Upgrade to Java 8; addneural-network dependency parser | shift reduce parser models |
| Version 3.4.1 | 2014-08-27 | Add Spanish models | shift reduce parser models |
| Version 3.4 | 2014-06-16 | Shift-reduce parser, dependency improvements, French parser uses CC tagset | shift reduce parser models |
| Version 3.3.1 | 2014-01-04 | English dependency "infmod" and "partmod" combined into "vmod", other minor dependency improvements | |
| Version 3.3.0 | 2013-11-12 | English dependency "attr" removed, other dependency improvements, imperative training data added | |
| Version 3.2.0 | 2013-06-20 | New CVG based English model with higher accuracy | |
| Version 2.0.5 | 2013-04-05 | Dependency improvements, -nthreads option, ctb7 model | |
| Version 2.0.4 | 2012-11-12 | Improved dependency code extraction efficiency, other dependency changes | |
| Version 2.0.3 | 2012-07-09 | Minor bug fixes | |
| Version 2.0.2 | 2012-05-22 | Some models now support training with extra tagged, non-tree data | |
| Version 2.0.1 | 2012-03-09 | Caseless English model included, bugfix for enforced tags | |
| Version 2.0 | 2012-02-03 | Threadsafe! | |
| Version 1.6.9 | 2011-09-14 | Improved recognition of imperatives, dependencies now explicitely include a root, parser knows osprey is a noun | |
| Version 1.6.8 | 2011-06-19 | New French model, improved foreign language models, bug fixes | |
| Version 1.6.7 | 2011-05-18 | Minor bug fixes. | |
| Version 1.6.6 | 2011-04-20 | Internal code and API changes (ArrayLists rather than Sentence; use of CoreLabel objects) to match tagger and CoreNLP. | |
| Version 1.6.5 | 2010-11-30 | Further improvements to English Stanford Dependencies and other minor changes | |
| Version 1.6.4 | 2010-08-20 | More minor bug fixes and improvements to English Stanford Dependencies and question parsing | |
| Version 1.6.3 | 2010-07-09 | Improvements to English Stanford Dependencies and question parsing, minor bug fixes | |
| Version 1.6.2 | 2010-02-26 | Improvements to Arabic parser models, and to English and Chinese Stanford Dependencies | |
| Version 1.6.1 | 2008-10-26 | Slightly improved Arabic and German parsing, and Stanford Dependencies | |
| Version 1.6 | 2007-08-19 | Added Arabic, k-best PCCFG parsing; improved English grammatical relations | |
| Version 1.5.1 | 2006-06-11 | Improved English and Chinese grammatical relations; fixed UTF-8 handling | |
| Version 1.5 | 2005-07-21 | Added grammatical relations output; fixed bugs introduced in 1.4 | |
| Version 1.4 | 2004-03-24 | Made PCFG faster again (by FSA minimization); added German support | |
| Version 1.3 | 2003-09-06 | Made parser over twice as fast; added tokenization options | |
| Version 1.2 | 2003-07-20 | Halved PCFG memory usage; added support for Chinese | |
| Version 1.1 | 2003-03-25 | Improved parsing speed; included GUI, improved PCFG grammar | |
| Version 1.0 | 2002-12-05 | Initial release |
The parser can read various forms of plain text input and can outputvarious analysis formats, including part-of-speech tagged text, phrasestructure trees, and a grammatical relations (typed dependency) format.For example, consider the text:
The strongest rain ever recorded in India shut down the financialhub of Mumbai, snapped communication lines, closed airports and forcedthousands of people to sleep in their offices or walk home during thenight, officials said today.The following output showspart-of-speech tagged text, then a context-free phrase structure grammarrepresentation, and finally a typed dependency representation. All ofthese are different views of the output of the parser.
The/DT strongest/JJS rain/NN ever/RB recorded/VBN in/IN India/NNPshut/VBD down/RP the/DT financial/JJ hub/NN of/IN Mumbai/NNP ,/,snapped/VBD communication/NN lines/NNS ,/, closed/VBD airports/NNSand/CC forced/VBD thousands/NNS of/IN people/NNS to/TO sleep/VB in/INtheir/PRP$ offices/NNS or/CC walk/VB home/NN during/IN the/DT night/NN,/, officials/NNS said/VBD today/NN ./. (ROOT (S (S (NP (NP (DT The) (JJS strongest) (NN rain)) (VP (ADVP (RB ever)) (VBN recorded) (PP (IN in) (NP (NNP India))))) (VP (VP (VBD shut) (PRT (RP down)) (NP (NP (DT the) (JJ financial) (NN hub)) (PP (IN of) (NP (NNP Mumbai))))) (, ,) (VP (VBD snapped) (NP (NN communication) (NNS lines))) (, ,) (VP (VBD closed) (NP (NNS airports))) (CC and) (VP (VBD forced) (NP (NP (NNS thousands)) (PP (IN of) (NP (NNS people)))) (S (VP (TO to) (VP (VP (VB sleep) (PP (IN in) (NP (PRP$ their) (NNS offices)))) (CC or) (VP (VB walk) (NP (NN home)) (PP (IN during) (NP (DT the) (NN night)))))))))) (, ,) (NP (NNS officials)) (VP (VBD said) (NP-TMP (NN today))) (. .)))det(rain-3, The-1)amod(rain-3, strongest-2)nsubj(shut-8, rain-3)nsubj(snapped-16, rain-3)nsubj(closed-20, rain-3)nsubj(forced-23, rain-3)advmod(recorded-5, ever-4)partmod(rain-3, recorded-5)prep_in(recorded-5, India-7)ccomp(said-40, shut-8)prt(shut-8, down-9)det(hub-12, the-10)amod(hub-12, financial-11)dobj(shut-8, hub-12)prep_of(hub-12, Mumbai-14)conj_and(shut-8, snapped-16)ccomp(said-40, snapped-16)nn(lines-18, communication-17)dobj(snapped-16, lines-18)conj_and(shut-8, closed-20)ccomp(said-40, closed-20)dobj(closed-20, airports-21)conj_and(shut-8, forced-23)ccomp(said-40, forced-23)dobj(forced-23, thousands-24)prep_of(thousands-24, people-26)aux(sleep-28, to-27)xcomp(forced-23, sleep-28)poss(offices-31, their-30)prep_in(sleep-28, offices-31)xcomp(forced-23, walk-33)conj_or(sleep-28, walk-33)dobj(walk-33, home-34)det(night-37, the-36)prep_during(walk-33, night-37)nsubj(said-40, officials-39)root(ROOT-0, said-40)tmod(said-40, today-41)
This output was generated with thecommand:
java -mx200m edu.stanford.nlp.parser.lexparser.LexicalizedParser-retainTMPSubcategories -outputFormat"wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz mumbai.txt