Movatterモバイル変換


[0]ホーム

URL:


SEP home page
Stanford Encyclopedia of Philosophy

Computational Linguistics

First published Thu Feb 6, 2014; substantive revision Wed Feb 26, 2014

“Human knowledge is expressed in language.So computationallinguistics is very important.–Mark Steedman, ACL Presidential Address (2007)

Computational linguistics is the scientific and engineeringdiscipline concerned with understanding written and spoken languagefrom a computational perspective, and building artifacts that usefullyprocess and produce language, either in bulk or in a dialogue setting.To the extent that language is a mirror of mind, a computationalunderstanding of language also provides insight into thinking andintelligence. And since language is our most natural and most versatilemeans of communication, linguistically competent computers wouldgreatly facilitate our interaction with machines and software of allsorts, and put at our fingertips, in ways that truly meet our needs,the vast textual and other resources of the internet.

The following article outlines the goals and methods ofcomputational linguistics (in historical perspective), and then delvesin some detail into the essential concepts of linguisticstructure and analysis (section 2), interpretation (sections3–5), and language use (sections 6–7), as well as acquisition ofknowledge for language (section 8), statistical and machine learningtechniques in natural language processing (section 9), andmiscellaneous applications (section 10).


1. Introduction: Goals and methods of computational linguistics

1.1 Goals of computational linguistics

The theoretical goals of computational linguistics include theformulation of grammatical and semantic frameworks for characterizinglanguages in ways enabling computationally tractable implementations ofsyntactic and semantic analysis; the discovery of processing techniquesand learning principles that exploit both the structural anddistributional (statistical) properties of language; and thedevelopment of cognitively and neuroscientifically plausiblecomputational models of how language processing and learning mightoccur in the brain.

The practical goals of the field are broad and varied. Some of the mostprominent are: efficient text retrieval on some desired topic;effective machine translation (MT); question answering (QA), rangingfrom simple factual questions to ones requiring inference anddescriptive or discursive answers (perhaps with justifications); textsummarization; analysis of texts or spoken language for topic,sentiment, or other psychological attributes; dialogue agents foraccomplishing particular tasks (purchases, technical trouble shooting,trip planning, schedule maintenance, medical advising, etc.); andultimately, creation of computational systems with human-likecompetency in dialogue, in acquiring language, and in gaining knowledgefrom text.

1.2 Methods of computational linguistics

The methods employed in theoretical and practical research incomputational linguistics have often drawn upon theories and findingsin theoretical linguistics, philosophical logic, cognitive science(especially psycholinguistics), and of course computer science.However, early work from the mid-1950s to around 1970 tended to berather theory-neutral, the primary concern being the development ofpractical techniques for such applications as MT and simple QA. In MT,central issues were lexical structure and content, the characterizationof “sublanguages” for particular domains (for example, weatherreports), and the transduction from one language to another (forexample, using rather ad hoc graph transformation grammars or transfergrammars). In QA, the concern was with characterizing the questionpatterns encountered in a specific domain, and the relationship ofthese question patterns to the forms in which answers might stored, forinstance in a relational database.

By the mid-1960s a number of researchers emboldened by the increasingpower and availability of general-purpose computers, and inspired bythe dream of human-level artificial intelligence, were designingsystems aimed at genuine language understanding and dialogue. Thetechniques and theoretical underpinnings employed varied greatly. Anexample of a program minimally dependent on linguistic or cognitivetheory was Joseph Weizenbaum's ELIZA program, intended to emulate (orperhaps caricature) a Rogerian psychiatrist. ELIZA relied on matchinguser inputs to stored patterns (brief word sequences interspersed withnumbered slots, to be filled from the input), and returned one of a setof output templates associated with the matched input pattern,instantiated with material from the input. While ELIZA and its modernchatbot descendants are often said to rely on mere trickery, it can beargued that human verbal behavior is to some degree reflexive in themanner of ELIZA, i.e., wefunction in “preprogrammed” or formulaicmanner in certain situations, for example, in exchanging greetings, orin responding at a noisy party to comments whose contents, apart froman occasional word, eluded us.

A very different perspective on linguistic processing was proffered inthe early years by researchers who took their cue from ideas aboutassociative processes in the brain. For example, M. Ross Quillian (1968)proposed a model of word sense disambiguation based on “spreadingactivation” in a network of concepts (typically corresponding to sensesof nouns) interconnected through relational links (typicallycorresponding to senses of verbs or prepositions). Variants of this“semantic memory” model were pursued by researchers such as Rumelhart,Lindsay and Norman (1972), and remain as an active research paradigm incomputational models of language and cognition. Another psychologicallyinspired line of work was initiated in the 1960s and pursued for overtwo decades by Roger Schank and his associates, but in his case thegoal was full story understanding and inferential question answering. Acentral tenet of the work was that the representation of sententialmeaning as well as world knowledge centered around a few (e.g., 11)action primitives, and inference was driven by rules associatedprimarily with these primitives; (a prominent exponent of a similarview was Yorick Wilks). Perhaps the most important aspect of Schank'swork was the recognition that language understanding and inference wereheavily dependent on a large store of background knowledge, includingknowledge of numerous “scripts” (prototypical ways in which familiarkinds of complex events, such as dining at a restaurant, unfold) andplans (prototypical ways in which people attempt to accomplish theirgoals) (Schank & Abelson 1977).

More purely AI-inspired approaches that also emerged in the 1960swere exemplified in systems such asSad Sam(Lindsay 1963),Sir (Raphael 1968)andStudent (Bobrow 1968). These featureddevices such as pattern matching/transduction for analyzing andinterpreting restricted subsets of English, knowledge in the form ofrelational hierarchies and attribute-value lists, and QA methods basedon graph search, formal deduction protocols and numerical algebra. Aninfluential idea that emerged slightly later was that knowledge in AIsystems should be framed procedurally rather thandeclaratively—to know something is to be able to perform certainfunctions (Hewitt 1969). Two quite impressive systems that exemplifiedsuch a methodology wereshrdlu (Winograd 1972)andLunar (Woods et al. 1972), which containedsophisticated proceduralized grammars and syntax-to-semanticsmapping rules, and were able to function fairly robustly in their“micro-domains” (simulated blocks on a table, and a lunar rockdatabase, respectively). In addition,shrdlu featuredsignificant planning abilities, enabled by themicroplanner goal-chaining language (a precursor ofProlog). Difficulties that remained for all of these approaches wereextending linguistic coverage and the reliability of parsing andinterpretation, and most of all, moving from microdomains, or coverageof a few paragraphs of text, to more varied, broader domains. Much ofthe difficulty of scaling up was attributed to the “knowledgeacquisition bottleneck”—the difficulty of coding or acquiring themyriad facts and rules evidently required for more generalunderstanding. Classic collections containing several articles on theearly work mentioned in the last two paragraphs are MarvinMinsky'sSemantic Information Processing (1968) and Schankand Colby'sComputer Models of Thought and Language (1973).

Since the 1970s, there has been a gradual trend away from purelyprocedural approaches to ones aimed at encoding the bulk of linguisticand world knowledge in more understandable, modular, re-usable forms,with firmer theoretical foundations. This trend was enabled by theemergence of comprehensive syntactico-semantic frameworks such asGeneralized Phrase Structure Grammar (GPSG), Head-driven PhraseStructure Grammar (HPSG), Lexical-Functional Grammar (LFG),Tree-Adjoining Grammar (TAG), and Combinatory Categorial Grammar (CCG),where in each case close theoretical attention was paid both to thecomputational tractability of parsing, and the mapping from syntax tosemantics. Among the most important developments in the latter areawere Richard Montague's profound insights into the logical (especiallyintensional) semantics of language, and Hans Kamp's and Irene Heim'sdevelopment of Discourse Representation Theory (DRT), offering asystematic, semantically formal account of anaphora in language.

A major shift in nearly all aspects of natural language processingbegan in the late 1980s and was virtually complete by the end of 1995:this was the shift to corpus-based, statistical approaches (signalledfor instance by the appearance of two special issues on the subject bythe quarterlyComputational Linguistics in 1993). The newparadigm was enabled by the increasing availability and burgeoningvolume of machine-readable text and speech data, and was drivenforward by the growing awareness of the importance of thedistributional properties of language, the development of powerful newstatistically based learning techniques, and the hope that thesetechniques would overcome the scalability problems that had besetcomputational linguistics (and more broadly AI) since itsbeginnings.

The corpus-based approach has indeed been quite successful in producingcomprehensive, moderately accurate speech recognizers, part-of-speech(POS) taggers, parsers for learned probabilistic phrase-structuregrammars, and even MT and text-based QA systems and summarizationsystems. However, semantic processing has been restricted to rathershallow aspects, such as extraction of specific data concerningspecific kinds of events from text (e.g.,location, date, perpetrators,victims, etc., of terrorist bombings) or extraction of clusters ofargument types, relational tuples, or paraphrase sets from textcorpora. Currently, the corpus-based, statistical approaches are stilldominant, but there appears to be a growing movement towardsintegration of formal logical approaches to language with corpus-basedstatistical approaches in order to achieve deeper understanding andmore intelligent behavior in language comprehension and dialoguesystems. There are also efforts to combine connectionist and neural-netapproaches with symbolic and logical ones. The following sections willelaborate on many of the topics touched on above. General referencesfor computational linguistics are Allen 1995, Jurafsky andMartin 2009, and Clark et al. 2010.

2. Syntax and parsing

2.1 The structural hierarchy

Language is structured at multiple levels, beginning in the case ofspoken language with patterns in the acoustic signal that can bemapped tophones (the distinguishable successive sounds ofwhich languages are built up). Groups of phones that are equivalentfor a given language (not affecting the words recognized by a hearer,if interchanged) are thephonemes of the language. Thephonemes in turn are the constituents ofmorphemes (minimalmeaningful word segments), and these provide the constituents ofwords. (In written language one speaks instead of characters,graphemes, syllables, and words.) Words are groupedintophrases, such as noun phrases, verb phrases, adjectivephrases and prepositional phrases, which are the structural componentsof sentences, expressing complete thoughts. At still higher levels wehave various types of discourse structure, though this is generallylooser than lower-level structure.

Techniques have been developed for language analysis at all of thesestructural levels, though space limitations will not permit a seriousdiscussion of methods used below the word level. It should be noted,however, that the techniques developed for speech recognition in the1980s and 1990s were very influential in turning NLP research towardsthe new corpus-based, statistical approach referred to above. Onekey idea was that ofhidden Markovmodels (HMMs), which model “noisy” sequences (e.g.,phone sequences, phoneme sequences, or word sequences) as if generatedprobabilistically by “hidden” underlying states and their transitions.Individually or in groups, successive hidden states model the moreabstract, higher-level constituents to be extracted from observed noisysequences, such as phonemes from phones, words from phonemes, or partsof speech from words. The generation probabilities and the statetransition probabilities are the parameters of such models, andimportantly these can be learned from training data. Subsequently themodels can be efficiently applied to the analysis of new data, usingfast dynamic programming algorithms such as theViterbi algorithm.These quite successful techniques were subsequently generalized tohigher-level structure, soon influencing all aspects on NLP.

2.2 Syntax

Before considering how grammatical structure can be represented,analyzed and used, we should ask what basis we might have forconsidering a particular grammar “correct”, or a particular sentence“grammatical,” in the first place. Of course, these are primarilyquestions for linguistics proper, but the answers we give certainlyhave consequences for computational linguistics.

Traditionally, formal grammars have been designed to capture linguists'intuitions about well-formedness as concisely as possible, in a waythat also allows generalizations about a particular language (e.g.,subject-auxiliary inversion in English questions) and across languages(e.g., a consistent orderingof nominal subject, verb, and nominalobject for declarative, pragmatically neutral main clauses). Concerninglinguists' specific well-formedness judgments, it is worth noting thatthese are largely in agreement not only with each other, but also withjudgments of non-linguists—at least for “clearly grammatical” and“clearly ungrammatical” sentences (Pinker 2007). Also the discoverythat conventional phrase structure supports elegant compositionaltheories of meaning lends credence to the traditional theoreticalmethodology.

However, traditional formal grammars have generally not covered anyone language comprehensively, and have drawn sharp boundaries betweenwell-formedness and ill-formedness, when in fact people's (includinglinguists') grammaticality judgments for many sentences are uncertainor equivocal. Moreover, when we seek to process sentences “in thewild”, we would like to accommodate regional, genre-specific, andregister-dependent variations in language, dialects, and erroneous andsloppy language (e.g., misspellings, unpunctuated run-on sentences,hesitations and repairs in speech, faulty constituent orderingsproduced by non-native speakers, and fossilized errors by nativespeakers, such as “for you and I”—possibly a product ofschoolteachers inveighing against “you and me” in subject position).Consequently linguists' idealized grammars need to be madevariation-tolerant in most practical applications. The way this needhas typically been met is by admitting a far greater number of phrasestructure rules than linguistic parsimony would sanction—say,10,000 or more rules instead of a few hundred. These rules are notdirectly supplied by linguists (computational or otherwise), butrather can be “read off” corpora of written or spoken language thathave been decorated by trained annotators (such as linguisticsgraduate students) with their basic phrasal treestructure. Unsupervised grammar acquisition (often starting withPOS-tagged training corpora) is another avenue (see section 9), butresults are apt to be less satisfactory. In conjunction withstatistical training and parsing techniques, this loosening of grammarleads to a rather different conception of what comprises agrammatically flawed sentence: It is not necessarily one rejected bythe grammar, but one whose analysis requires some rarely usedrules.

As mentioned insection 1.2, the representations of grammars usedin computational linguistics have varied from procedural ones to onesdeveloped in formal linguistics, and systematic, tractably parsablevariants developed by computationally oriented linguists. Winograd'sshrdlu program, for example, contained code inhisprogrammar language expressing,

To parse asentence, try parsing a noun phrase (NP); if this fails, return NIL,otherwise try parsing a verb phrase (VP) next and if this fails, orsucceeds with words remaining, return NIL, otherwise return success.

Similarly Woods' grammar forlunar was based on acertain kind of procedurally interpreted transition graph (anaugmented transition network, or ATN), where the sentence subgraphmight contain an edge labeled NP (analyze an NP using the NPsubgraph) followed by an edge labeled VP (analogouslyinterpreted). In both cases, local feature values (e.g.,thenumber andperson of a NP and VP) areregistered, and checked for agreement as a condition for success. Aclosely related formalism is that of definite clause grammars (e.g.,Pereira & Warren 1982), which employ Prolog to assert “facts” suchas that if the input word sequence contains an NP reaching fromindexI1 to indexI2 and a VP reaching fromindexI2 to indexI3, then the input contains asentence reaching from indexI1 to indexI3. (Again,feature agreement constraints can be incorporated into such assertionsas well.) Given the goal of proving the presence of a sentence, thegoal-chaining mechanism of Prolog then provides a proceduralinterpretation of these assertions.

At present the most commonly employed declarative representationsof grammatical structure arecontext-free grammars(CFGs) as defined by Noam Chomsky (1956, 1957), because oftheir simplicity and efficient parsability. Chomsky had argued thatonly deep linguistic representations are context-free, while surfaceform is generated bytransformations (for example, in Englishpassivization and in question formation) that result in anon-context-free language. However, it was later shown that on the onehand, unrestricted Chomskian transformational grammars allowed forcomputationally intractable and even undecidable languages, and on theother, that the phenomena regarded by Chomsky as calling for atransformational analysis could be handled within a context-freeframework by use of suitablefeatures in the specification ofsyntactic categories. Notably,unbounded movement, such asthe apparent movement of the final verb object to the front of thesentence in “Which car did Jack urge you to buy?”, was shown to beanalyzable in terms of agap (orslash) feature oftype /NP[wh] that is carried by each of the two embedded VPs,providing a pathway for matching the category of the fronted object tothe category of the vacated object position. Withinnon-transformational grammar frameworks, one therefore speaks ofunbounded (orlong-distance)dependenciesinstead of unbounded movement. At the same time it should be notedthat at least some natural languages have been shown to be mildlycontext-sensitive (e.g., Dutch and Swiss German exhibit cross-serialdependencies where a series of nominals “NP1 NP2 NP3 …”need to be matched, in the same order, with a subsequent seriesof verbs, “V1 V2 V3 …”). Grammatical frameworks thatseem to allow for approximately the right degree of mild contextsensitivity include Head Grammar, Tree-Adjoining Grammar (TAG),Combinatory Categorial Grammar (CCG), and Linear Indexed Grammar(LIG). Head grammars allow insertion of a complement between the headof a phrase (e.g., the initial verb of a VP, the final noun of a NP,or the VP of a sentence) and an already present complement; they werea historical predecessor of Head-Driven Phrase Structure Grammar(HPSG), a type of unification grammar (see below) that has receivedmuch attention in computational linguistics. However, unrestrictedHPSG can generate the recursively enumerable (in general onlysemi-decidable) languages.

A typical (somewhat simplified) sample fragment of a context-freegrammar is the following, where phrase types are annotated withfeature-value pairs:

S[vform:v] NP[pers:pnumb:n case:subj] VP[vform:v pers:pnumb:n]
VP[vform:v pers:pnumb:n] V[subcat:_npvform:v pers:p numb:n] NP[case:obj]
NP[pers:3 numb:n] Det[pers:3numb:n] N[numb:n]
NP[numb:npers:3 case:c]Name[numb:n pers:3 case:c]

Herev, n, p, c are variables that can assume values suchas ‘past’, ‘pres’, ‘base’,‘pastparticiple’, … (i.e., various verb forms),‘1’, ‘2’, ‘3’ (1st,2nd, and 3rd person), ‘sing’,‘plur’, and ‘subj’,‘obj’. Thesubcat feature indicates thecomplement requirements of the verb. The lexicon would supply entriessuch as

V[subcat:_np vform:pres numb:sing pers:3]loves
Det[pers:3 numb:sing]a
N[pers:3 numb:sing]mortal
Name[pers:3 numb:sing gend:fem case:subj]Thetis,

allowing, for example, a phrase structure analysis of the sentence“Thetis loves a mortal” (where we have omitted the featurenames for simplicity, leaving only their values, and ignored the casefeature):

[a tree diagram: at the top,S[pres], a line connects the top node to first, NP[3 sing subj], whichconnects to, Name[3 sing subj], which connects to‘Thetis’.  A second line from the top node connects toVP[pres 3 sing] which in turn first connects to V[_np pres 3 sing]that connects to ‘loves’.  Second it connects to NP[3sing] that in turn connects to Def[3 sing] (and that to‘a’) and N[3 sing] (and that to ‘mortal’).]
Figure 1: Syntactic analysis of a sentence as a parse tree

As a variant of CFGs,dependency grammars (DGs)also enjoy wide popularity. The difference from CFGs is thathierarchical grouping is achieved by directly subordinating words towords (allowing for multipledependents ofaheadword), rather than phrases to phrases. Forexample, in the sentence of figure 1 we would treatThetisandmortal as dependents ofloves, using dependencylinks labeledsubj andobj respectively, and thedeterminera would in turn be a dependentofmortal,via a dependency linkmod(formodifier).Projective dependency grammars areones with no crossing dependencies (so that the descendants of a nodeform a continuous text segment), and these generate the same languagesas CFGs. Significantly, mildly non-projective dependency grammars,allowing a head word to dominate two separated blocks, provide thesame generative capacity as the previously mentioned mildlycontext-sensitive frameworks that are needed for some languages(Kuhlmann 2013).

As noted at the beginning of this section, traditional formalgrammars proved too limited in coverage and too rigid in theirgrammaticality criteria to provide a basis for robust coverage ofnatural languages as actually used, and this situation persisted untilthe advent of probabilistic grammars derived from sizablephrase-bracketed corpora (notably the Penn Treebank). The simplestexample of this type of grammar is aprobabilistic context-freegrammar or PCFG. In a PCFG, each phrase structure ruleXY1 …Yk is assigned a probability,viewed as the probability that a constituent of typeX will beexpanded into a sequence of (immediate) constituents oftypesY1, …,Yk. At the lowest level, theexpansion probabilities specify how frequently a given part of speech(such as Det, N, or V) will be realized as a particular word. Such agrammar provides not only a structural but also a distributional modelof language, predicting the frequency of occurrence of various phrasesequences and, at the lowest level, word sequences.

However, the simplest models of this type do not model thestatistics of actual language corpora very accurately, because theexpansion probabilities for a given phrase type (or part of speech)Xignore the surrounding phrasal context and the more detailedproperties (such as head words) of the generated constituents. Yetcontext and detailed properties are very influential; for example,whether the final prepositional phrase in “She detected astar with {binoculars, planets}” modifiesdetectedorplanets is very dependent on word choice. Such modelinginaccuracies lead to parsing inaccuracies (see next subsection), andtherefore generative grammar models have been refined in various ways,for example (in so-calledlexicalized models) allowing forspecification of particular phrasal head words in rules, or(intree substitution grammars) allowing expansion ofnonterminals into subtrees of depth 2 or more. Nevertheless, it seemslikely that fully accurate distributional modeling of language wouldneed to take account of semantic content, discourse structure, andintentions in communication, not only of phrasestructure. Possiblyconstruction grammars (e.g., Goldberg2003), which emphasize the coupling between the entrenched patterns oflanguage (including ordinary phrase structure, clichés, andidioms) and their meanings and discourse function, will provide aconceptual basis for building statistical models of languagethat are sufficiently accurate to enable more nearly human-likeparsing accuracy.

2.3 Parsing

Natural language analysis in the early days of AI tended to rely ontemplate matching, for example, matching templates such as(XhasY) or(how manyYare thereonX) to the input to be analyzed. This of coursedepended on having a very restricted discourse and task domain. By thelate 1960s and early 70s, quite sophisticated recursive parsingtechniques were being employed. For example, Woods'lunar system used a top-down recursive parsing strategyinterpreting an ATN in the manner roughly indicated insection 2.2 (though ATNs in principle allow otherparsing styles). It also saved recognized constituents in a table,much like the class of parsers we are about to describe. Later parserswere influenced by the efficient and conceptually elegant CFG parsersdescribed by Jay Earley (1970) and (separately) by John Cocke, TadaoKasami, and Daniel Younger (e.g., Younger 1967). The latter algorithm,termed the CYK or CKY algorithm for the three separate authors, wasparticularly simple, using a bottom-up dynamic programming approach tofirst identify and tabulate the possible types (nonterminal labels) ofsentence segments of length 1 (i.e., words), then the possible typesof sentence segments of length 2, and so on, always building on thepreviously discovered segment types to recognize longer phrases. Thisprocess runs in cubic time in the length of the sentence, and a parsetree can be constructed from the tabulated constituents in quadratictime. The CYK algorithm assumes a Chomsky Normal Form (CNF) grammar,allowing only productions of form Np →Nq Nr, or Npw, i.e., generation of two nonterminals or a word fromany given nonterminal. This is only a superficial limitation, becausearbitrary CF grammars are easily converted to CNF.

The method most frequently employed nowadays in fully analyzingsentential structure ischart parsing. This is aconceptually simple and efficient dynamic programming method closelyrelated to the algorithms just mentioned; i.e., it begins by assigningpossible analyses to the smallest constituents and then inferringlarger constituents based on these, until an instance of the top-levelcategory (usually S) is found that spans the given text or textsegment. There are many variants, depending on whether only completeconstituents are posited or incomplete ones as well (to beprogressively extended), and whether we proceed left-to-right throughthe word stream or in some other order (e.g., some seeminglybest-first order). A common variant is aleft-corner chartparser, in which partial constituents are posited whenever their“left corner”—i.e., leftmost constituent on theright-hand side of a rule—is already in place. Newly completedconstituents are placed on anagenda, and items aresuccessively taken off the agenda and used if possible as left cornersof new, higher-level constituents, and to extend partially completedconstituents. At the same time, completed constituents (or rather,categories) are placed in a chart, which can be thought of as atriangular table of widthn and heightn (the numberof words processed), where the cell at indices (i, j),withj >i, contains thecategories of all complete constituents so far verified reaching frompositioni to positionj in the input. The chart isused both to avoid duplication of constituents already built, andultimately to reconstruct one or more global structural analyses. (Ifall possible chart entries are built, the final chart will allowreconstruction of all possible parses.) Chart-parsing methods carryover to PCFGs essentially without change, still running within a cubictime bound in terms of sentence length. An extra task is maintainingprobabilities of completed chart entries (and perhaps bounds onprobabilities of incomplete entries, for pruning purposes).

Because of their greater expressiveness, TAGs and CCGs are harder toparse in the worst case (O(n6)) than CFGsand projective DGs (O(n3)), at least withcurrent algorithms (see Vijay-Shankar & Weir 1994 for parsingalgorithms for TAG, CCG, and LIG based on bottom-up dynamicprogramming). However, it does not follow that TAG parsing or CCGparsing is impractical for real grammars and real language, and infact parsers exist for both that are competitive with more commonCFG-based parsers.

Finally we mentionconnectionist models of parsing,which perform syntactic analysis using layered (artificial) neuralnets (ANNs, NNs) (see Palmer-Brown et al. 2002; Mayberry andMiikkulainen 2008; and Bengio 2008 for surveys). There is typically alayer of input units (nodes), one or more layers of hidden units, andan output layer, where each layer has (excitatory and inhibitory)connections forward to the next layer, typically conveying evidencefor higher-level constituents to that layer. There may also beconnections within a hidden layer, implementing cooperation orcompetition among alternatives. A linguistic entity such as a phoneme,word, or phrase of a particular type may be represented within a layereither by a pattern of activation of units in that layer(adistributed representation) or by a single activated unit(alocalist representation).

One of the problems that connectionist models need to confront is thatinputs are temporally sequenced, so that in order to combineconstituent parts, the network must retain information about recentlyprocessed parts. Two possible approaches are the use ofsimplerecurrent networks (SRNs)and, in localist networks, sustainedactivation. SRNs use one-to-one feedback connections from the hiddenlayer to specialcontext unitsaligned with the previous layer(normally the input layer or perhaps a secondary hidden layer), ineffect storing their current outputs in those context units. Thus atthe next cycle, the hidden units can use their own previous outputs,along with the new inputs from the input layer, to determine their nextoutputs. In localist models it is common to assume that once a unit(standing for a particular concept) becomes active, it stays active forsome length of time, so that multiple concepts corresponding tomultiple parts of the same sentence, and their properties, can besimultaneously active. A problem that arises is how the properties ofan entity that are active at a given point in time can be properly tiedto that entity, and not to other activated entities. (This is thevariable binding problem, which has spawned a variety ofapproaches—see Browne and Sun 1999). One solution is to assumethat unit activation consists of pulses emitted at a globally fixedfrequency, and pulse trains that are in phase with one anothercorrespond to the same entity (e.g., see Henderson 1994). Muchcurrent connectionist research borrows from symbolic processingperspectives, by assuming that parsing assigns linguistic phrasestructures to sentences, and treating the choice of a structure assimultaneous satisfaction of symbolic linguistic constraints (orbiases). Also, more radical forms of hybridization and modularizationare being explored, such as interfacing a NN parser to a symbolicstack, or using a neural net to learn the probabilities needed in astatistical parser, or interconnecting the parser network withseparate prediction networks and learning networks. For an overviewof connectionist sentence processing and some hybrid methods (seeCrocker 2010).

2.4 Coping with syntactic ambiguity

If natural language were structurally unambiguous with respect to somecomprehensive, effectively parsable grammar, our parsing technologywould presumably have attained human-like accuracy some time ago,instead of levelling off at about 90% constituent recognitionaccuracy. In fact, however, language is ambiguous at all structurallevels: at the level of speech sounds (“recognize speech”vs. “wreck a nice beach”); morphology(“un-wrapped” vs. “unwrap-ped”); word category(round as an adjective, noun, verb or adverb); compound wordstructure (wild goose chase); phrase category(nominalthat-clausevs. relative clause in“the ideathat he is entertaining”); and modifier(or complement) attachment (“He hit the manwith thebaguette”). The parenthetical examples here have been chosenso that their ambiguity is readily noticeable, but ambiguities are farmore abundant than is intuitively apparent, and the number ofalternative analyses of a moderately long sentence can easily run intothe thousands.

Naturally, alternative structures lead to alternative meanings, asthe above examples show, and so structural disambiguation isessential. The problem is exacerbated by ambiguities in the meaningsand discourse function even of syntactically unambiguous words andphrases, as discussed below (section 4). Buthere we just mention some of the structural preference principles thathave been employed to achieve at least partial structuraldisambiguation. First, some psycholinguistic principles that have beensuggested areRight Association (RA) (orLateClosure, LC),Minimal Attachment (MA), andLexicalPreference (LP). The following examples illustrate theseprinciples:

(2.1)
(RA) He bought the book that I had selectedfor Mary.
(Note the preference for attachingfor Mary toselected rather thanbought.)
(2.2)
(MA?) She carried thegroceriesfor Mary.
(Note the preference for attachingfor Mary tocarried, rather thangroceries,despite RA. The putative MA-effect might actually be an LP-like verbmodification preference.)
(2.3)
(LP) She describes men who haveworked on farmsascowboys.
(Note the preference forattachingas cowboys todescribes, ratherthanworked.)

Another preference noted in the literature is for parallel structure incoordination, as illustrated by the following examples:

(2.4)
They asked fortea and coffee with sugar.
(Note the preference for the grouping[[tea and coffee]with sugar], despite RA.)
(2.5)
John decided tobuy a novel, and Mary, a biography.
(The partially elided conjunct isunderstood as “Mary decided to buy a biography”.)
(2.6)
John submitted short storiesto the editor, and poems too.
(The partially elided conjunct isunderstood as “submitted poems to the editor too”.)

Finally, the following example serves to illustrate thesignificance of frequency effects, though such effects are hard todisentangle from semantic biases for any single sentence (improvementsin parsing through the use of word and phrase frequencies provide morecompelling evidence):

(2.7)
What are thedegrees of freedom that an object in space has?
(Note the preference for attaching therelative clause todegrees of freedom, ratherthanfreedom, attributable to the tendency ofdegree(s)of freedom to occur as a “multiword”.)

3. Semantic representation

Language serves to convey meaning. Therefore the analysis ofsyntactic structure takes us only partway towards mechanizing thatcentral function, and the merits of particular approaches to syntaxhinge on their utility in supporting semantic analysis, and ingenerating language from the meanings to be communicated.

This is not to say that syntactic analysis is of no value initself—it can provide a useful support in applications such asgrammar checking and statistical MT. But for the more ambitious goalof inferring and expressing the meaning of language, an essentialrequirement is a theory of semantic representation, and how it isrelated to surface form, and how it interacts with the representationand use of background knowledge. We will discuss logicist approaches,cognitive science approaches, and (more briefly) emerging statisticalapproaches to meaning representation.

3.1 Logicist approaches to meaning representation

Most linguistic semanticists, cognitive scientists, andanthropologists would agree that in some sense, language is a mirror ofmind. But views diverge concerning how literally or non-literally thistenet should be understood. The most literal understanding, which wewill term thelogicist view,is the one that regards languageitself as a logical meaning representation with a compositional,indexical semantics—at least when we have added brackets asdetermined by parse trees, and perhaps certain other augmentation(variables, lambda-operators, etc.)In itself, such a view makes no commitments about mentalrepresentations, but application of Occam's razor and the presumedco-evolution of thought and language then suggest that mentalese isitself language-like. The common objection that “human thinking is notlogical” carries no weight with logicists, because logical meaningrepresentations by no means preclude nondeductive modes of inference(induction, abduction, etc.);nor are logicists impressed by the objection that people quickly forgetthe exact wording of verbally conveyed information, because bothcanonicalization of inputs and systematic discarding of all but majorentailments can account for such forgetting. Also assumption of alanguage-like,logical mentalese certainly does not preclude other modes ofrepresentation and thought, such as imagistic ones, and synergisticinteraction with such modes (Paivio 1986; Johnston & Williams 2009).

Relating language to logic

Since Richard Montague (see especially Montague 1970, 1973) deservesmuch of the credit for demonstrating that language can be logicallyconstrued, let us reconsider the sentence structure in figure 1 andthe corresponding grammar rules and vocabulary, but this timesuppressing features, and instead indicating how logicalinterpretations expressed in (a variant of) Montague's type-theoreticintensional logic can be obtained compositionally. We slightly“twist” Montague's type system so that the possible-worldargument always comes last, rather than first, in the denotation of asymbol or expression. For example, a two-place predicate will be oftype (e → (e → (st)))(successively applying to an entity, another entity, and finally apossible world to yield a truth value), rather than Montague's type(s → (e → (et))), wherethe world argument is first. This dispenses with numerous applicationsof Montague's intension () and extension() operators, and also slightly simplifies truthconditions. For simplicity we are also ignoring contextual indiceshere, and treating nouns and VPs as true or false of individuals,rather than individual concepts (as employed by Montague to accountfor such sentences as “The temperature is 90 and rising”).

S NP VP; S′=NP′(VP′)
VPV NP;VP′=x NP′(λyV′(y)(x)))
NP Det N;NP′= Det′(N′)
NP Name; NP′= Name′

Here primed constituents represent the intensional logic translationsof the corresponding constituents. (Or we can think of them asmetalinguistic expressions standing for the set-theoretic denotationsof the corresponding constituents.) Several points should be noted.First, each phrase structure rule is accompanied by a unique semanticrule (articulated as therule-to-rulehypothesis by Emmon Bach (1976)),where the denotation of each phrase is fully determined by thedenotations of itsimmediateconstituents: the semantics iscompositional.

Second, in the S′-rule, the subject is assumed to be a second-orderpredicate that is applied to the denotation of the VP (a monadicpredicate) to yield a sentence intension, whereas we would ordinarilythink of the subject-predicate semantics as being the other wayaround, with the VP-denotation being applied to the subject. ButMontague's contention was that his treatment was the proper one,because it allows all types of subjects—pronouns, names, andquantified NPs—to be handled uniformly. In other words, an NPalways denotes a second-order property, or (roughly speaking) a set offirst-order properties (see also Lewis 1970). So for example,Thetis denotes the set of all properties that Thetis (acertain contextually determined individual with that name) has; (moreexactly, in the present formulationThetis denotes a functionfrom properties to sentence intensions, where the intension obtainedfor a particular property yields truth in worlds where the entityreferred to has that property);some woman denotes the unionof all properties possessed by at least one woman; andevery woman denotes the set of properties shared by allwomen. Accordingly, the S′-rule yields a sentence intension that istrue at a given world just in case the second-order property denotedby the subject maps the property denoted by the VP to such atruth-yielding intension.

Third, in the VP′-rule, variablesx andyare assumed to be of typee (they take basic individuals asvalues), and the denotation of a transitive verb should be thought ofas a function that is applied first to the object, and then to thesubject (yielding a function from worlds to truth values—a sentenceintension). The lambda-abstractions in the VP′-rule can beunderstood as ensuring that the object NP, which like any NP denotes asecond-order property, is correctly applied to an ordinary property(that of being the love-object of a certainx), and theresult is a predicate with respect to the (still open) subjectposition. The following is an interpreted sample vocabulary:

Vloves; V′=loves
Deta; Det′= λP λQ(∃x[P(x) ∧Q(x)])
(For comparison:
Detevery;Det′ = λPλQ(∀x[P(x) ⊃Q(x)])
Nmortal; N′= mortal
NameThetis; Name′=λP(P(Thetis))

Note the interpretation of the indefinite determiner(on line 2)as a generalized quantifier—in effecta second-order predicate over two ordinary properties, where theseproperties have intersecting truth domains. We could have used anatomic symbol for this second-order predicate, but the above way ofexpanding it shows the relation of the generalized quantifier to theordinary existential quantifier. Though it is a fairly self-evidentmatter, we will indicate insection 4.1 how the sentence “Thetis loves a mortal” yields the following representation after somelambda-conversions:

(∃x [mortal(x) ∧loves(x)(Thetis)]).

(The English sentence also has a generic or habitual reading,“Thetis loves mortals in general”, which we ignore here.) Thisinterpretation has rather a classical look to it, but only because ofthe reduction from generalized to ordinary quantifiers that we havebuilt into the lexical semantics of the indefinitea in theabove rules, instead of using an atomic symbol for it. Montague wasparticularly interested in dealing satisfactorily with intensionallocutions, such as “John seeks a unicorn.” This does notrequire the existence of a unicorn for its truth—John has a certainrelation to the unicorn-property, rather than to an existingunicorn. Montague therefore treated all predicate arguments asintensions; i.e., he rendered “John seeks a unicorn”as

seeks(λQx[unicorn(x)∧Q(x)]) (john),

which can be reduced to a version where unicorn is extensionalized tounicorn*:

seeks(λQx[unicorn*(x)∧Q(x)]) (john).

But ultimately Montague's treatment of NPs, though it was in a sensethe centerpiece of his proposed conception of language-as-logic, wasnot widely adopted in computational linguistics. This was in partbecause the latter community was not convinced that an omega-orderlogic was needed for NL semantics, found the somewhat complex treatmentof NPs in various argument positions and in particular, the treatmentof scope ambiguities in terms of multiple syntactic analyses, unattractive, and was preoccupied with other semantic issues, such asadequately representing events and their relationships, and developingsystematic nominal and verb “ontologies” for broad-coverage NLanalysis. Nonetheless, the construal of language as logic left a strongimprint on computational semantics, generally steering the fieldtowards compositional approaches, and in some approaches such as CCG,providing a basis for a syntax tightly coupled to a type-theoreticsemantics (Bach et al. 1987;Carpenter 1997).

An alternative to Montague's syntax-based approach to quantifier scopeambiguity is to regard NPs of form Det+N (or strictly, Det+N-bar) asinitially unscoped higher-order predicates in an underspecified logicalform, to be subsequently “raised” so as to apply to a first-orderpredicate obtained by lambda-abstraction of the vacated term position.For example, in the sentence “Everyone knows a poem”, with the objectexistentially interpreted, we would have the underspecified LF

knows⟨a(poem)⟩⟨every(person)⟩

(without reducingdeterminers to classical quantifiers) and we can now “raise”⟨a(poem)⟩ to yield

a(poem)(λyknows(y)⟨every(person)⟩,

and then “raise”⟨every(person)⟩ to yield either

a(poem)(λyevery(person)(λx knows(y)(x))),

or

every(person)(λxa(poem)(λy knows(y)(x))).

Thus we obtain a reading according to which there is a poem thateveryone knows, and another according to which everyone knows somepoem (not necessarily the same one). (More on scope disambiguationwill follow insection 4). A systematic versionof this approach, known as Cooper storage (see Barwise & Cooper1981) represents the meaning of phrases in two parts, namely asequence of NP-interpretations (as higher-order predicates) and thelogical matrix from which the NP-interpretations were extracted.

But one can also take a more conventional approach, where first ofall, the use of “curried” (Schönfinkel-Church-Curry)functions in the semantics of predication is avoided in favor ofrelational interpretations, using lexical semantic formulas suchasloves′yλx(loves(x,y)), andsecond, unscoped NP-interpretations are viewed as unscoped restrictedquantifiers (Schubert & Pelletier 1982). Thus the unscoped LFabove would be knows(⟨∃poem⟩,⟨∀person⟩), and scoping of quantifiers, along withtheir restrictors, now involves “raising” quantifiers totake scope over a sentential formula, with simultaneous introductionof variables. The two results corresponding to the two alternativescopings are then

(∃y: poem(y))(∀x:person(x))knows(x,y),

and

(∀x:person(x))(∃y: poem(y))knows(x,y).

While this strategy departsfrom the strict compositionality of Montague Grammar, it achievesresults that are often satisfactory for the intended purposes and doesso with minimal computational fuss. A related approach to logical formand scope ambiguity enjoying some current popularity isminimalrecursion semantics (MRS) (Copestake et al.2005), which goes even further in fragmenting the meaningful parts ofan expression, with the goal of allowing incremental constraint-basedassembly of these pieces into unambiguous sentential LFs. Anotherinteresting development is an approach based oncontinuations, a notion takenfrom programming language theory (where acontinuation is a program execution state as determined by the stepsstill to be executed after the current instruction). This also allowsfor a uniform account of the meaning of quantifiers, and providesa handle on such phenomena as “misplaced modifiers”, as in “He had a quick cup of coffee” (Barker 2004).

An important innovation in logical semantics wasdiscourserepresentation theory (DRT) (Kamp 1981; Heim 1982), aimed at asystematic account of anaphora. In part, the goal was to provide asemantic explanation for (in)accessibility of NPs as referents ofanaphoric pronouns, e.g., in contrasting examples such as “Johndoesn't drive a car; *he owns it,” vs. “John drives acar; he owns it”. More importantly, the goal was to account forthe puzzling semantics of sentences involving donkey anaphora, e.g.,“If John owns a donkey, he beats it.” Not only is theNPa donkey, the object of the if-clause, accessible asreferent of the anaphoricit, contrary to traditionalsyntactic binding theory (based on the notion of C-command), butfurthermore we seem to obtain an interpretation of the type“John beats every donkey that he owns”, which cannot beobtained by “raising” the embedded indefiniteadonkey to take scope over the entire sentence. There is also aweaker reading of the type, “If John owns a donkey, he beats adonkey that he owns”, and this reading also is not obtainablevia any scope analysis. Kamp and Heim proposed a dynamic process ofsentence interpretation in which a discourse representation structure(DRS) is built up incrementally. A DRS consists of a setofdiscourse referents (variables) and a set ofconditions, where theseconditions may be simple predications orequations over discourse referents, or certain logical combinations ofDRS's (not of conditions). The DRS for the sentence under considerationcan be written linearly as

[: [x,y: john(x),donkey(y)] [u,v:he(u), it(v),beats(u,v),u=x,v=y]]

or diagrammed as

[a box with a horizontal linebreaking it into two.  The top half takes up about one-sixth of thespace and is empty.  The bottom half contains two other boxes side byside with a double right arrow connecting them.  The left box is alsodivided in two parts horizontally; the top half has ‘x,y’;the bottom half has three lines containing ‘john(x)’,‘donkey(y)’, and ‘owns(x,y)’ respectively.The right box is also divided in two parts horizontally; the top halfhas ‘u,v’; the bottom half has five lines containing‘he(u)’, ‘it(u)’, ‘beats(u,v)’,‘u=x’, and ‘v=y’ respectively.]
Figure 2: DRS for “If John owns a donkey, he beats it”

Herex, y, u, v are discourse referents introducedbyJohn, a donkey, he, andit, and theequationsu=x, v=y represent the result of referenceresolution forhe andit. Discourse referents in theantecedent of a conditional are accessible in the consequent, anddiscourse referents in embedding DRSs are accessible in the embeddedDRSs. Semantically, the most important idea is that discoursereferents are evaluated dynamically. We think of a variable assignmentas astate, and this state changes as we evaluate a DRSoutside-to-inside, left-to-right. For example (simplifying a bit), theconditional DRS in figure 4 is true (in a given model) if everyassignment with domain {x,y} that makes the antecedent true can be extended to anassignment (new state) with domain{x,y,u,v} that makes the consequenttrue.

On the face of it, DRT is noncompositional (though DRS constructionrules are systematically associated with phrase structure rules); butit can be recast in compositional form, still of course with a dynamicsemantics. A closely related approach,dynamic predicatelogic (DPL) retains the classical quantificationalsyntax, but in effect treats existential quantification asnondeterministic assignment, and provides an overtly compositionalalternative to DRT (Groenendijk & Stokhof 1991). Perhapssurprisingly, the impact of DRT on practical computational linguisticshas been quite limited, though it certainly has been and continues tobe actively employed in various projects. One reason may be thatdonkey anaphora rarely occurs in the text corpora most intensivelyinvestigated by computational linguists so far (though it is arguablypervasive and extremely important in generic sentences andgeneric passages, including those found in lexicons or sources such asCommon Sense Open Mind—see sections4.3 and8.3). Another reasonis that reference resolution for non-donkey pronouns (and definiteNPs) is readily handled by techniques such as Skolemization ofexistentials, so that subsequently occurring anaphors can beidentified with the Skolem constants introduced earlier. Indeed, itturns out that both explicit and implicit variants of Skolemization,including functional Skolemization, are possible even for donkeyanaphora (e.g., in sentences such as “If every man has a gun, manywill use it”—see Schubert 2007). Finally, another reason forthe limited impact of DRT and other dynamic semantic theories may beprecisely that they are dynamic: The evaluation of a formula ingeneral requires its preceding and embedding context, and thisinterferes with the kind of knowledge modularity (the ability to useany given knowledge item in a variety of different contexts) desirablefor inference purposes. Here it should be noted that straightforwardtranslation procedures from DRT, DPL, and other dynamic theories tostatic logics exist (e.g., to FOL, for nonintensional versions of thedynamic approaches), but if such a conversion is desirable forpractical purposes, then the question arises whether starting with adynamic representation is at all advantageous.

Thematic roles and (neo-)Davidsonian representations

A long-standing issue in linguistic semantics has been the theoreticalstatus of thematic roles in the argument structure of verbs andother argument-taking elements of language (e.g., Dowty 1991). Thesyntactically marked cases found in many languages correspondintuitively to such thematic roles asagent,theme, patient, instrument,recipient, goal, and so on, and in English, too, the sentencesubjectand object typically correspond respectively to the agent and theme orpatient of an action, and other roles may be added as an indirectobject or more often as prepositional phrase complements and adjuncts.To give formal expression to these intuitions, many computationallinguists decompose verbal (and other) predicates derived from languageinto a core predicate augmented with explicit binary relationsrepresenting thematic roles. For example, the sentence

(3.1)
John kicked the ball to the fence

might be represented (after referent determination) as

e(kick(e) ∧ before(e,Now1) ∧ agent(e, John)∧ theme(e, Ball2) ∧ goal-loc(e, Fence3)),

wheree is thought of as the kicking event. Such arepresentation is calledneo-Davidsonian, acknowledgingDonald Davidson's advocacy of the view that verbs tacitly introduceexistentially quantified events (Davidson 1967a). The prefixneo- indicates that all arguments and adjuncts arerepresented in terms of thematic roles, which was not part ofDavidson's proposal but is developed, for example, in (Parsons1990). (Parsons attributes the idea of thematic roles to the 4thcentury BCE Sanskrit grammarian Pāṇini.) Oneadvantage of this style of representation is that it absolves thewriter of the interpretive rules from the vexing task ofdistinguishing verb complements, to be incorporated into the argumentstructure of the verb, from adjuncts, to be used to add modifyinginformation. For example, it is unclear in (3.1) whetherto thefence should be treated as supplying an argument ofkick, or whether it merely modifies the action of Johnkicking the ball. Perhaps most linguists would judge the latter answerto be correct (because an object can be kicked without the intent ofpropelling it to a goal location), but intuitions are apt to beambivalent for at least one of a set of verbs such asdribble,kick, maneuver, move andtransport.

However, thematic roles also introduce new difficulties. As pointed outby Dowty (1991), thematicroles lack well-defined semantics. Forexample, while (3.1) clearly involves an animate agent acting causallyupon a physical object, and the PP evidently supplies a goal location,it is much less clear what the roles should be in (web-derived)sentences such as (3.2–3.4), and what semantic content they wouldcarry:

(3.2)
The surf tossed theloosened stones against our feet.
(3.3)
A large truck in frontof him blocked his view of the traffic light.
(3.4)
Police used a snifferdog to smell the suspect's luggage.

As well, the uniform treatment of complements and adjuncts in termsof thematic relations does not absolve the computational linguist fromthe task of identifying the subcategorized constituents of verbphrases (and similarly, NPs and APs), so as to guide syntactic andsemantic expectations in parsing and interpretation. And thesesubcategorized constituents correspond closely to the complements ofthe verb, as distinct from any adjuncts. Nevertheless, thematic rolerepresentations are widely used, in part because they mesh well withframe-based knowledge representations for domainknowledge. These are representations that characterize a concept interms of its type (relating this to supertypes and subtypes in aninheritance hierarchy), and a set of slots (also calledattributes orroles) and corresponding values, withtype constraints on values. For example, in a purchasing domain, wemight have apurchase predicate, perhaps with supertypeacquire, subtypes likepurchase-in-installments,purchase-on-credit, orpurchase-with-cash, and attributes with typed valuessuch as(buyer (a person-or-group)), (seller (aperson-or-group)), (item (a thing-or-service)), (price (amonetary-amount)), and perhaps time, place, and otherattributes. Thematic roles associated with relevant senses of verbsand nouns such asbuy, sell, purchase, acquire, acquisition,take-over, pick up, invest in, splurge on, etc., can easily bemapped to standard slots like those above. This leads into theissue of canonicalization, which we briefly discuss below under aseparate heading.

Expressivity issues

A more consequential issue in computational semantics has been theexpressivity of the semantic representation employed, with respect tophenomena such as event and temporal reference, nonstandard quantifierssuch asmost, plurals,modification, modality and other forms ofintensionality, and reification. Full discussion of these phenomenawould be out of place here, but some commentary on each is warranted,since the process of semantic interpretation and understanding (as wellas generation) clearly depends on the expressive devices available inthe semantic representation.

Event and situation reference are essential in view of the fact thatmany sentences seem to describe events or situations, and to qualify andrefer to them. For example, in the sentences

(3.5)
Molly barked last night forseveral minutes.This woke up the neighbors.

the barking event is in effect predicated to have occurred lastnight and to have lasted for several minutes, and the demonstrativepronounthis evidently refers directlyto it; in addition the past tense placesthe event at some point prior to the time of speech (and would do soeven without the temporal adverbials). These temporal and causalrelations are readily handled within the Davidsonian (orneo-Davidsonian) framework mentioned above:

(3.5′)
bark(Molly,E) ∧last-night(E,S) ∧ before(E,S)∧duration(E)=minutes(N) ∧ several(N).cause-to-wake-up(E, Neighbors,E′) ∧ before(E′,S).

However, examples (3.6) and (3.7) suggest that events can beintroduced by negated or quantified formulas, as was originallyproposed by Reichenbach (1947):

(3.6)
Norain fell for a month, and this caused widespread cropfailures.
(3.7)
Each superpower imperiled theother with its nuclear arsenal.This situation persisted fordecades.

Barwise and Perry (1983) reconceptualized this idea in theirSituation Semantics, though this lacks the tight coupling betweensentences and events that is arguably needed to capture causalrelations expressed in language. Schubert (2000) proposes a solutionto this problem in an extension of FOL incorporating an operator thatconnects situations or events with sentencescharacterizingthem.

Concerningnonstandard quantifiers such asmost,we have already sketched the generalized quantifier approach ofMontague Grammar, and pointed out the alternative of using restrictedquantifiers; an example might be (Most x:dog(x))friendly(x). Instead ofviewingmost as a second-order predicate, we can specify itssemantics by analogy with classical quantifiers: The sample formula istrue (under a given interpretation) just in case a majority ofindividuals satisfyingdog(x) (when used as value ofx) alsosatisfyfriendly(x). Quantifying determiners suchasfew, many, much, almost all, etc., can be treatedsimilarly, though ultimately the problem of vagueness needs to beaddressed as well (which of course extends beyond quantifiers topredicates and indeed all aspects of a formal semanticrepresentation). Vague quantifiers, rather than setting rigidquantitative bounds, seem instead to convey probabilistic information,as if a somewhat unreliable measuring instrument had been applied informulating the quantified claim, and the recipient of the informationneeds to take this unreliability into account in updatingbeliefs. Apart from their vagueness, the quantifiers under discussionare not first-order definable (e.g., Landman 1991), so that theycannot be completely axiomatized in FOL. But this does not preventpractical reasoning, either by direct use of such quantifiers inthe logical representations of sentences (an approach in the spiritofnatural logic), or by reducing them to set-theoretic ormereological relations within an FOL framework.

Plurals, as for instance in

(3.8)
Peoplegathered in the town square,

present a problem in that the argument of a predicate can be anentity comprised of multiple basic individuals (those we ordinarilyquantify over, and ascribe properties to). Most approaches to thisproblem employ a plural operator, say,plur, allowing us tomap a singular predicateP into a plural predicateplur(P), applicable to collective entities. Thesecollective entities are usually assumed to form a join semilatticewith atomic elements (singular entities) that are ordinary individuals(e.g., Scha 1981; Link 1983; Landman 1989, 2000). When an overlaprelation is assumed, and when all elements of the semilattice areassumed to have a supremum (completeness), the result is a completeBoolean algebra except for lack of a bottom element (because there isno null entity that is a part of all others). One theoretical issue isthe relationship of the semilattice of plural entities to thesemilattice of material parts of which entities areconstituted. Though there are differences in theoretical details(e.g., Link 1983; Bunt 1985), it is agreed that these semilatticesshould be aligned in this sense: When we take the joinof material parts of which several singular or plural entities areconstituted, we should obtain the material parts of the join of thosesingular or plural entities. Note that while some verbal predicates,such as (intransitive)gather, are applicable only tocollections, others, such asate a pizza, are variouslyapplicable to individuals or collections. Consequently, a sentencesuch as

(3.9)
Thechildren ate a pizza,

allows for both a collective reading, where the children as a groupate a single pizza, and a distributive reading, where each of thechildren ate a pizza (presumably a different one!) One way of dealingwith such ambiguities in practice is to treat plural NPs as ambiguousbetween a collection-denoting reading and an “each member of thecollection” reading. For examplethe children in (3.9)would be treated as ambiguous betweenthe collection ofchildren (which is the basic sense of the phrase) andeach ofthe children. This entails that a reading of typeeach of the people should also be available in (3.8) — but we can assume that this is ruled out because (intransitive)gather requires a collective argument. In a sentence such as

(3.10)
Two poachers caught three aracaris,

we then obtain four readings, based on the two interpretations ofeach NP. No readings are ruled out, because both catching and beingcaught can be individual or collective occurrences. Some theoristswould posit additional readings, but if these exist, they could beregarded as derivative from readings in which at least one of theterms is collectively interpreted. But what is uncontroversial is thatplurals call for an enrichment in the semantic representation languageto allow for collections as arguments. In an expression suchasplur(child), both theplur operator,which transforms a predicate into another predicate, and the resultingcollective predicate, are of nonstandard types.

Modification is a pervasivephenomenon in all languages, as illustratedin the following sentences:

(3.11)
Mary is very smart.
(3.12)
Mary is an internationalcelebrity.
(3.13)
The rebellion failed utterly.

In (3.11),very functions as a predicate modifier, inparticular a subsective modifier, since the set of things thatarevery(P) is a subset of the things thatareP. Do we need such modifiers in our logical forms? Wecould avoid use of a modifier in this case by supposingthatsmart has a tacit argument for the degree of smartness,wheresmart(x, d) means thatx is smart todegreed; adding thatd >T for somethresholdT would signify thatx is very smart.Other degree adjectives could be handled similarly. However, such astrategy is unavailable forinternational celebrity in(3.12).International is again subsective (and notintersective—an international celebrity is not something that isboth international and a celebrity), and while one can imaginedefinitions of the particular combination,internationalcelebrity, in an ordinary FOL framework, requiring suchdefinitions to be available for constructing initial logical formscould create formidable barriers to broad-coverageinterpretation. (3.13) illustrates a third type of predicatemodification, namely VP-modification by an adverb. Note that themodifier cannot plausibly be treated as an implicitpredicationutter(E) about a Davidsonian eventargument offail. Taken together, the examples indicate thedesirability of allowing for monadic-predicate modifiers in a semanticrepresentation. Corroborative evidence is provided in the immediatelyfollowing discussion.

Intensionality has already been mentioned in connection withMontague Grammar, and there can be no doubt that a semanticrepresentation for natural language needs to capture intensionality insome way. The sentences

(3.14)
John believes that our universe isinfinite.
(3.15)
John looked happy.
(3.16)
John designed a starship.
(3.17)
John wore a fake beard.

all involve intensionality. The meaning (and thereby the truthvalue) of the attitudinal sentence (3.14) depends on the meaning(intension) of the subordinate clause, not just its truth value(extension). The meaning of (3.15) depends on the meaning ofhappy, but does not requirehappy to be a propertyof John or anything else. The meaning of (3.16) does not depend on theactual existence of a starship, but does depend on the meaning of thatphrase. Andfake beard in (3.17) refers to something otherthan an actual beard, though its meaning naturally depends on themeaning ofbeard. A Montagovian analysis certainly woulddeal handily with such sentences. But again, we may ask how much ofthe expressive richness of Montague's type theory is really essentialfor computational linguistics. To begin with, sentences such as (3.14)are expressible in classical modal logics, without committing tohigher types. On the other hand (3.16) resists a classical modalanalysis, even more firmly than Montague's “John seeks aunicorn,” for which an approximate classical paraphrase ispossible: “John tries (for him) to find a unicorn”. Amodest concession to Montague, sufficient to handle(3.15)–(3.17), is to admit intensional predicate modifiers intoour representational vocabulary. We can then treatlook as apredicate modifier, so thatlook(happy) is a newpredicate derived from the meaning ofhappy. Similarly we cantreatdesign as a predicate modifier, if we are willing totreata starship as a predicative phrase, as we would in“The Enterprise is a starship”. And finally,fakeis quite naturally viewed as a predicate modifier, though unlike mostnominal modifiers, it is not intersective (#John wore somethingthat was a beard and was fake) or even subsective (#John worea particular kind of beard). Note that this form ofintensionality does not commit us to a higher-order logic—we arenot quantifying over predicate extensions or intensions so far, onlyover individuals (aside from the need to allow for plural entities, asnoted). The rather compelling case for intensional predicate modifiersin our semantic vocabulary reinforces the case made above (on thebasis on extensional examples) for allowing predicatemodification.

Reification, like thephenomena already enumerated, is also pervasivein natural languages. Examples are seen in the following sentences.

(3.18)
Humankind may be on a path toself-destruction.
(3.19)
Snow is white.
(3.20)
Politeness is a virtue.
(3.21)
Driving recklessly isdangerous.
(3.22)
For John to sulk is unusual.
(3.23)
That our universeis infinite is a discredited notion.

(3.18)–(3.21) are all examples of predicatereification.Humankind in (3.18) may be regarded as the nameof an abstract kind derived from the nominal predicatehuman,i.e., with lexical meaningK(human),whereK maps predicate intensions to individuals. The statusof abstract kinds as individuals is evidenced by the fact that thepredicate “be on a path to self-destruction” applies asreadily to ordinary individuals as to kinds. The name-like characterof the term is apparent from the fact that it cannot readily bepremodified by an adjective. The subjects in (3.19) and (3.20) can besimilarly analyzed in terms of kindsK(snow) andK(-ness(polite)). (Here -ness is apredicate modifier that transforms the predicatepolite,which applies to ordinary (usually human) individuals, into apredicate over quantities of the abstract stuff, politeness.) But inthese cases theK operator does not originate in the lexicon,but in a rule pair of type “NP → N, NP′ =K(N′)”. This allows for modification of the nominalpredicate before reification, in phrases such asfluffy snoworexcessive politeness. The subject of (3.21) might berendered logically as something likeKa(-ly(reckless)(drive)), whereKa reifies action-predicates, and -ly transforms a monadic predicate intension into a subsective predicatemodifier. Finally (3.22) illustrates a type of sentential-meaningreification, again yielding a kind; but in this case it is a kind ofsituation—the kind whose instances are characterized by Johnsulking. Here we can posit a reification operatorKe thatmaps sentence intensions into kinds of situations. This type ofsentential reification needs to be distinguished fromthat-clause reification, such as appears to be involved in(3.14). We mentioned the possibility of a modal-logic analysis of(3.14), but a predicative analysis, where the predicate applies to areified sentence intension (a proposition) is actually more plausible,since it allows a uniform treatment of that-clauses in contexts like(3.14) and (3.23). The use of reification operators is a departurefrom a strict Montgovian approach, but is plausible if we seek tolimit the expressiveness of our semantic representation by takingpredicates to be true or false of individuals, rather than of objectsof arbitrarily high types, and likewise take quantification to be overindividuals in all cases, i.e., to be first-order.

Some computational linguists and AI researchers wish to go muchfurther in avoiding expressive devices outside those of standardfirst-order logic. One strategy that can be used to deal withintensionality within FOL is to functionalize all predicates, save oneor two. For example, we can treat predications, such as that Romeoloves Juliet, as values of functions that “hold” at particular times:Holds(loves(Romeo, Juliet),t). Hereloves is regarded as a function that yields a reifiedproperty, whileHolds (or in someproposals,True), and perhaps equality, are the onlypredicates in the representation language. Then we can formalize(3.14), for example, without recourse to intensional semantics as

Holds(believes(John, infinite(Universe)),t)

(wheret is some specific time).Humankind in(3.18) can perhaps be represented as the set of all humans as afunction of time:

xt[Holds(member(x,Humankind),t) ↔Holds(human(x),t)],

(presupposing some axiomatization of naïve set theory); and, asone more example, (4.22) might be rendered as

Holds(unusual(sulk(John)), t)

(for some specific timet). However, a difficulty with thisstrategy is encountered for quantification within intensionalcontexts, as in the sentence “John believes that every galaxyharbors some life-form.” While we can represent the(implausible) wide-scope reading “For every galaxy there is somelife-form such that John believes that the galaxy harbors thatlife-form,” using theHolds strategy, we cannot readilyrepresent the natural narrow-scope reading because FOL disallowsvariable-binding operators within functional terms (but see McCarthy1990). An entirely different approach is tointroduce “eventuality” arguments into all predicates, andto regard a predication as providing a fact about the actual worldonly if the eventuality corresponding to that predication has beenasserted to “occur” (Hobbs 2003). The main practicalimpetus behind such approaches is to be able to exploit existing FOLinference techniques and technology. However, there is at present noreason to believe that any inferences that are easy in FOL aredifficult in a meaning representation more nearly aligned with thestructure of natural language; on the contrary, recent work inimplementingnatural logic (MacCartney & Manning 2009)suggests that a large class of obvious inferences can be most readilyimplemented in syntactically analyzed natural language(modulo some adjustments)—a framework closer toMontagovian semantics than an FOL-based approach.

Canonicalization, thematic roles (again), and primitives

Another important issue has beencanonicalization(ornormalization):What transformations should be applied to initiallogical forms inorder to minimize difficulties in making use of linguistically derivedinformation? The uses that should be facilitated by the choice ofcanonical representation include the interpretation of further texts inthe context of previously interpreted text (and general knowledge), aswell as inferential question answering and other inference tasks.

We can distinguish two types of canonicalization:logicalnormalization andconceptual canonicalization. Anexample of logical normalization in sentential logic and FOL is theconversion to clause form (Skolemized, quantifier-free conjunctivenormal form). The rationale is that reducing multiple logicallyequivalent formulas to a single form reduces the combinatorialcomplexity of inference. However, full normalization may not bepossible in an intensional logic with a “fine-grained”semantics, where for instance a belief that the Earth is round maydiffer semantically from the belief that the Earth is round and theMoon is either flat or not flat, despite the logical equivalence ofthose beliefs.

Conceptual canonicalization involves more radical changes:We replace the surface predicates (and perhaps other elements of therepresentational vocabulary) with canonical terms from a smallerrepertoire, and/or decompose them using thematic roles or frame slots.For example, in a geographic domain, we might replace the relations(between countries)is next to, is adjacent to, borders on, is aneighbor of, shares a border with, etc., with a single canonicalrelation, sayborders-on. In the domain of physical,communicative, and mental events, we might go further and decomposepredicates into configurations of primitive predicates. For example,we might express “x walks” in the manner of Schankas

e, e′(ptrans(e, x, x) ∧move(e′, x, feet-of(x))∧by-means-of(e′, e)),

whereptrans(e, x, y) is a primitivepredicate expressing that evente is a physical transport byagentx of objecty,moveexpresses bodily motion by an agent, andby-means-of expresses the instrumental-actionrelation between themoveevent and theptrans event.As discussed earlier, these multi-argument predicates might be furtherdecomposed, withptrans(e, x, y) rewritten asptrans(e)∧agent(e, x)∧theme(e, y), and so on. As inthe case of logical normalization, conceptual canonicalization isintended to simplify inference, and to minimize the need for theaxioms on which inference is based.

A question raised by canonicalization, especially by the strongerversions involving reduction to primitives, is whether significantmeaning is lost in this process. For example, the concept of beingneighboring countries, unlike mere adjacency, suggests the idea ofside-by-side existence of the populations of the countries, in a waythat resembles the side-by-side existence of neighbors in a localcommunity. More starkly, reducing the notion of walking to transportingoneself by moving one's feet fails to distinguish walking from running,hopping, skating, and perhaps even bicycling. Therefore it may bepreferable to regard conceptual canonicalization as inference ofimportant entailments, rather than as replacement of superficiallogical forms by equivalent ones in a more restricted vocabulary.Another argument for the latter position is computational: If wedecompose complex actions, such as dining at a restaurant, intoconstellations of primitive predications, we will need to match themany primitive parts of such constellations even in answering simplequestions such as “Did John dine at a restaurant?”. We will commentfurther on primitives in the context of the following subsection.

3.2 Psychologically motivated approaches to meaning

While many AI researchers have been interested in semanticrepresentation and inference as practical means for achievinglinguistic and inferential competence in machines, others haveapproached these issues from the perspective of modeling humancognition. Prior to the 1980s, computational modeling of NLP andcognition more broadly were pursued almost exclusively within arepresentationalist paradigm, i.e.,one that regarded all intelligentbehavior as reducible to symbol manipulation (Newell and Simon'sphysical symbol systems hypothesis). In the 1980s, connectionist (orneural) models enjoyed a resurgence, and came to be seen by many asrivalling representationalist approaches. We briefly summarize thesedevelopments under two subheadings below.

Representationalist approaches

“A physical symbol system has the necessary and sufficient meansfor general intelligent action.”–Allen Newell and Herbert Simon(1976: 116)

Some of the cognitively motivated researchers working within arepresentationalist paradigm have been particularly concerned withcognitive architecture, including the associative linkagesbetween concepts, distinctions between types of memories and types ofrepresentations (e.g., episodicvs. semantic memory,short-term vs. long-term memory, declarative vs. proceduralknowledge), and the observable processing consequences of sucharchitectures, such as sense disambiguation, similarity judgments,and cognitive load as reflected in processing delays. Others have beenmore concerned with uncovering the actual internal conceptualvocabulary and inference rules that seem to underlie language andthought. M. Ross Quillian's semantic memory model, and modelsdeveloped by Rumelhart, Norman and Lindsay (Rumelhart et al. 1972;Norman et al. 1975) and by Anderson and Bower (1973) arerepresentative of the former perspective, while Schank and hiscollaborators (Schank and Colby 1973; Schank and Abelson 1977; Schankand Riesbeck 1981; Dyer 1983) are representative of the latter. Acommon thread in cognitively motivated theorizing about semanticrepresentation has been the use of graphical semantic memorymodels, intended to capture direct relations as well as moreindirect associations between concepts, as illustrated in Figure 3:

[two trees, all parents connectto their nodes by solid arrowed lines unless otherwise indicated.  Thefirst, on the left, with a boxed parent of the word‘plant’ and nodes of ‘plant1’ (which itselfhas nodes of ‘structure’ [connected by a line labeled‘isa’], ‘living’ [connected by a line labeled‘prop’], ‘animal’ [connected by a line labeled‘-prop’], ‘with3’ [connected with a linelabeled ‘prop’ and with a node ‘leaf’],‘get2’ [connected with a line labeled ‘prop’and also a reverse line labeled ‘subj’, it has two nodes,‘food’ connected with a line labeled ‘obj’ and‘from’, the latter also has nodes ‘air’,‘earth’ and a dash boxed ‘water’ {theconnection lines of these three nodes are also connected with a dashedarc labeled ‘or’}]), ‘plant2’ (which itselfhas a node of ‘apparatus’ connected by a line labeled‘isa’ and two dashed lines going to nowhere), and‘plant3’ (which itself has three dashed lines going tonowhere); the connection lines of the three plant nodes are alsoconnected with a solid arc labeled ‘or’.  The second treewith a boxed parent of the word ‘water’ has two dashedlines going nowhere and a node ‘supply5’ (which itself hasthree nodes, ‘person’ connected with a line labeled‘subj’, the dash boxed ‘water’ from the firsttree connected with a line labeled ‘obj’, and‘To2’ [which itself has a node ‘object’]); thethree lines under the top parent are connected by a dashed arc labeled‘or’]
Figure 3

This particular example is loosely based on Quillian(1968). Quillian suggested that one of the functions of semanticmemory, conceived in this graphical way, was to enable word sensedisambiguation through spreading activation. For example, processingof the sentence, “He watered the plants”, would involveactivation of the termswater andplant, and this activation would spread to conceptsimmediately associated with (i.e., directly linked to) those terms,and in turn to the neighbors of those concepts, and so on. Thepreferred senses of the initially activated terms would be those thatled to early “intersection” of activation signals originating fromdifferent terms. In particular, the activation signals propagatingfrom sense 1 (the living-plant sense) ofplant wouldreach the concept for the stuff,water, in four steps (alongthe pathways corresponding to the information that plants may get foodfrom water), and the same concept would be reached in two steps fromthe termwater, used as a verb, whose semantic representationwould express the idea of supplying water to some targetobject. Though the sense ofplant as a manufacturingapparatus would probably lead eventually to thewater conceptas well, the corresponding activation path would be longer, and so theliving-plant sense ofplant would “win”.

Such conceptual representations have tended to differ from logical onesin several respects. One, as already discussed, has been the emphasisby Schank and various other researchers (e.g.,Wilks 1978; Jackendoff1990) on “deep” (canonical) representations and primitives. An oftencited psychological argument for primitives is the fact that peoplerather quickly forget the exact wording of what they read or are told,recalling only the “gist”; it is this gist that primitive decompositionis intended to derive. However, this involves a questionable assumptionthat subtle distinctions between, say, walking to the park,ambling to the park, ortraipsing to the park are simply ignoredin the interpretive process, and as noted earlier it neglects thepossibility that seemingly insignificant semantic details arepruned from memory after a short time, while major entailments areretained for a longer time.

Another common strain in much of the theorizing about conceptualrepresentation has been a certain diffidence concerning logicalrepresentations and denotational semantics. The relevant semantics oflanguage is said to be the transduction from linguistic utterances tointernal representations, and the relevant semantics of the internalrepresentations is said to be the way they are deployed inunderstanding and thought. For both the external language and theinternal (mentalese) representation, it is said to be irrelevantwhether or not the semantic framework provides formal truth conditionsfor them. The rejection of logical semantics has sometimes beensummarized in the dictum that one cannot compute with possible worlds.

However, it seems that any perceived conflict between conceptualsemantics and logical semantics can be resolved by noting that thesetwo brands of semantics are quite different enterprises with quitedifferent purposes. Certainly it is entirely appropriate for conceptualsemantics to focus on the mapping from language to symbolic structures(in the head, realized ultimately in terms of neural assemblies orcircuits of some sort), and on the functioning of these structures inunderstanding and thought. But logical semantics, as well, has alegitimate role to play, both in considering how words (and largerlinguistic expressions) relate to the world and how the symbols andexpressions of the internal semantic representation relate to theworld. This role ismetatheoreticin that the goal is not toposit cognitive entities that can be computationally manipulated, butrather to provide a framework for theorizing about the relationshipbetween the symbols people use, externally in language and internallyin their thinking, and the world in which they live. It is surelyundeniable that utterances are at least sometimes intended to beunderstood as claims about things, properties, and relationshipsin the world, and as such are at least sometimes true or false. It would be hardto understand how language and thought could have evolved as usefulmeans for coping with the world, if they were incapable of capturingtruths about it.

Moreover, logical semantics shows how certain syntactic manipulationslead from truths to truths regardless of the specific meanings of thesymbols involved in these manipulations (and these notions can beextended to uncertain inference, though this remains only verypartially understood). Thus, logical semantics provides a basis forassessing the soundness (or otherwise) of inference rules. While humanreasoning as well as reasoning in practical AI systems often needs toresort to unsound methods (abduction, default reasoning, Bayesianinference, analogy, etc.), logical semantics nevertheless provides anessential perspective from which to classify and study the propertiesof such methods. A strong indication that cognitively motivatedconceptual representations of language are reconcilable with logicallymotivated ones is the fact that all proposed conceptual representationshave either borrowed deliberately from logic in the first place (intheir use of predication, connectives, set-theoretic notions, andsometimes quantifiers) or can be transformed to logical representationswithout much difficulty, despite being cognitively motivated.

Connectionist approaches

As noted earlier, the 1980s saw the re-emergence of connectionistcomputational models within mainstream cognitive science theory (e.g.,Feldman and Ballard 1982; Rumelhart and McClelland 1986; Gluck andRumelhart 1990). We have already briefly characterized connectionistmodels in our discussion of connectionist parsing. But theconnectionist paradigm was viewed as applicable not only to specializedfunctions, but to a broad range of cognitive tasks includingrecognizing objects in an image, recognizing speech,understanding language, making inferences, and guiding physicalbehavior. The emphasis was on learning, realized by adjusting theweights of the unit-to-unit connections in a layered neural network,typically by aback-propagationprocess that distributes credit orblame for a successful or unsuccessful output to the units involved inproducing the output (Rumelhart and McClelland 1986).

From one perspective, the renewal of interest in connectionism andneural modeling was a natural step in the endeavor to elaborateabstract notions of cognitive content and functioning to the pointwhere they can make testable contact with brain theory andneuroscience. But it can also be seen as a paradigm shift, to theextent that the focus on subsymbolic processing began to be linked to agrowing skepticism concerning higher-level symbolic processing asmodels of mind, of the sort associated with earlier semanticnetwork-based and rule-based architectures. For example, Ramsay et al.(1991) argued that the demonstrated capacity of connectionistmodels to perform cognitively interesting tasks undermined thethen-prevailing view of the mind as a physical symbol system. Butothers have continued to defend the essential role of symbolicprocessing. For example, Anderson (1983, 1993) contended that whiletheories of symbolic thought need to be grounded in neurally plausibleprocessing, and while subsymbolic processes are well-suited forexploiting the statistical structure of the environment, neverthelessunderstanding the interaction of these subsymbolic processes required atheory of representation and behavior at the symbolic level.

What would it mean for the semantic content of an utterance to berepresented in a neural network, enabling, for example, inferentialquestion-answering? The anti-representationalist (or “eliminativist”)view would be that no particular structures can be or need to beidentified as encoding semantic content. The input modifies theactivity of the network and the strengths of various connections in adistributed way, such that the subsequent behavior of the networkeffectively implements inferential question-answering. However, thisleaves entirely open how a network would learn this sort of behavior.The most successful neural net experiments have been aimed at mappinginput patterns to class labels or to other very restricted sets ofoutputs, and they have required numerous labeled examples (e.g.,thousands of images labeled with the class of the objects depicted) tolearn their task. By contrast, humans excel at “one-shot”learning, and can perform complex tasks based on such learning.

A less radical alternative to the eliminativist position, termed thesubsymbolic hypothesis, wasproposed by Smolensky (1988), to the effectthat mental processing cannot be fully and accurately described interms of symbol manipulation, requiring instead a description at thelevel of subsymbolic features, where these features are represented ina distributed way in the network. Such a view does not preclude thepossibility thatassemblies of units in a connectionist system do in fact encode symbolsand more complex entities built out of symbols, such as predicationsand rules. It merely denies that the behavior engendered by theseassemblies can be adequately modelled as symbol manipulation.In fact, much of the neural net research over the past two or threedecades has sought to understand how neural nets can encode symbolicinformation (e.g., see Smolensky et al.1992; Browne and Sun 2001).

Distributed schemes associate a set of units and their activationstates with particular symbols or values. For example, Feldman (2006)proposes that concepts are represented by the activity of a cluster ofneurons; triples of such clusters representing a concept, a role, and afiller (value) are linked together bytrianglenodes to representsimple attributes of objects. Language understanding is treated as akind of simulation that maps language onto a more concrete domain ofphysical action or experience, guided by background knowledge in theform of a temporal Bayesian network.

Global schemes encode symbols in overlapping fashion over all units.One possible global scheme is to view the activation states of theunits, with each unit generating a real value between −1 and 1, aspropositions: Statepentails stateq(equivalently,p is atleast as specific asq) ifthe activationqi of each unitiin stateq satisfiespiqi ≤ 0, orqi = 0, or 0 ≤qipi depending on whether theactivationpi of that unit in statep is negative, zero, or positive respectively. Propositionalsymbols can then be interpreted in terms of such states, and truthfunctions in terms of simple max-min operations and sign inversionsperformed on network states. (See Blutner, 2004; however, Blutnerultimately focuses on a localist scheme in which units representatomic propositions and connections represent biconditionals.)Holographic neural network schemes (e.g., Manger et al. 1994;Plate 2003) can also be viewed as global; in the simplest cases theseuse one “giant neuron” that multiplies an input vector whosecomponents are complex numbers by a complex-valued matrix; a componentof the resultant complex-valued output vector, written in polarcoordinates asreiθ,supplies a classification through the value ofθ and aconfidence level through the value ofr. A distinctivecharacteristic of such networks is their ability to classify orreconstruct patterns from partial or noisy inputs.

The status of the subsymbolic hypothesis remains an issue for debateand further research. Certainly it is unclear how symbolic approachescould match certain characteristics of neural network approaches, suchas their ability to cope with novel instances and their gracefuldegradation in the face of errors or omissions. On the other hand, someneural network architectures for storing knowledge and performinginference have been shown (or designed) to be closely related to “softlogics” such as fuzzy logic (e.g.,Kasabov 1996; Kecman 2001) or“weight-annotated Poole systems” (Blutner 2004), suggesting thepossibility that neural network models of cognition may ultimately becharacterizable as implementations of such soft logics. Researchersmore concerned with practical advances than biologically plausiblemodeling have also explored the possibility of hybridizing thesymbolic and subsymbolic approaches, in order to gain the advantages ofboth (e.g., Sun 2001). A quiteformal example of this, drawing on ideasby Dov Gabbay, is d'Avila Garcez (2004).

Finally, we should comment on the view expressed in some of thecognitive science literature that mental representations of languageare primarily imagistic (e.g., Damasio 1994; Humphrey 1992). Certainlythere is ample evidence for the reality and significance of mentalimagery (Johnson-Laird 1983; Kosslyn 1994). Also creative thoughtoften seems to rely on visualization, as observed early in the 20thcentury by Poincaré (1913) and Hadamard (1945). But as waspreviously noted, symbolic and imagistic representations may wellcoexist and interact synergistically. Moreover, cognitive scientistswho explore the human language faculty in detail, such as StevenPinker (1994, 2007) or any of the representationalist or connectionistresearchers cited above, all seem to reach the conclusion that thecontent derived from language (and the stuff of thought itself) is inlarge part symbolic—except in the case of the eliminativistswho deny representations altogether. It is not hard to see, however,how raw intuition might lead to the meanings-as-images hypothesis. Itappears that vividconsciousness is associated mainly withthe visual cortex, especially area V1, which is also cruciallyinvolved in mental imagery (e.g., Baars 1997: chapter 6). Consequentlyit is entirely possible that vast amounts of non-imagistic encodingand processing of language go unnoticed, while any evoked imagisticartifacts become part of our conscious experience. Further, the veryact of introspecting on what sort of imagery, if any, is evoked by agiven sentence may promote construction of imagery and awarenessthereof.

3.3 Statistical semantics

In its broadest sense, statistical semantics is concerned withsemantic properties of words, phrases, sentences, and texts, engenderedby their distributional characteristics in large text corpora. Forexample, terms such ascheerful, exuberant, anddepressedmay beconsidered semantically similar to the extent that they tend to occurflanked by the same (or in turn similar) nearby words. (For somepurposes, such as information retrieval, identifying labels ofdocuments may be used as occurrence contexts.) Through carefuldistinctions among various occurrence contexts, it may also bepossible to factor similarity into more specific relations such assynonymy, entailment, and antonymy. One basic difference between(standard) logical semantic relations and relations based ondistributional similarity is that the latter are a matter of degree.Further, the underlying abstractions are very different, in thatstatistical semantics does not relate strings to the world, but only totheir contexts of occurrence (a notion similar to, but narrower than,Wittgenstein's notion of meaning as use). However, statisticalsemantics does admit elegant formalizations. Various concepts ofsimilarity and other semantic relations can be captured in terms ofvector algebra, by viewing the occurrence frequencies of an expressionas values of the components of a vector, with the componentscorresponding to the distinct contexts of occurrence. In this way, onearrives at a notion of semantics based on metrics and operators invector spaces, where vector operators can mimic Boolean operators invarious ways (Gärdenfors 2000; Widdows 2004; Clarke 2012).

But how does this bear on meaning representation of natural languagesentences and texts? In essence, the representation of sentencesin statistical semantics consists of the sentences themselves. The ideathat sentences can be used directly, in conjunction with distributionalknowledge, as objects enabling inference is a rather recent andsurprising one, though it was foreshadowed by many years of work onquestion answering based on large text corpora. The idea has gainedtraction as a result of recent efforts to devise statistically basedalgorithms for determiningtextualentailment, a program pushed forwardby a series of Recognizing Textual Entailment (RTE) Challengesinitiated in 2005, organized by the PASCAL Network of Excellence, andmore recently by the National Institute of Standards and Technology(NIST). Recognizing textual entailment requires judgments as to whetherone given linguistic string entails a second one, in a sense ofentailment that accords with human intuitions about what a person wouldnaturally infer (with reliance on knowledge about word meanings,general knowledge such as that any person who works for a branch of acompany also works for that company, and occasional well-known specificfacts). For example, “John is a fluent French speaker” textuallyentails “John speaks French”,while “The gastronomic capital of France is Lyon” does not entail that “The capital of France is Lyon”. Someexamples are intermediate; e.g.,“John was born in France” isconsidered to heighten the probability that John speaks French, withoutfully entailing it (Glickman and Dagan 2005). Initial results in theannual competitions were poor (not far above the random guessing mark),but have steadily improved, particularly with the injection of somereasoning based on ontologies and on some general axioms about themeanings of words, word classes, relations, and phrasal patterns(e.g., de Salvo Braz et al. 2005).

It is noteworthy that the conception of sentences as meaningrepresentations echoes Montague's contention that languageis logic. Ofcourse, Montague understood “sentences” as unambiguous syntactic trees.But research in textual entailment seems to be moving towards a similarconception, as exemplified in the work of Dagan et al. (2008),where statistical entailment relations are based on syntactic trees,and these are generalized to templates that may replace subtrees bytyped variables. Also Clarke (2012) proposes a very generalvector-algebraic framework for statistical semantics, where “contexts”for sentences might include (multiple) parses and even (multiple)logical forms for the sentences, and where statistical sentencemeanings can be built up compositionally from their proper parts. Oneway of construing degrees of entailment in this framework is in termsof the entailment probabilities relating each possible logical form ofthe premise sentence to each possible logical form of the hypothesis inquestion.

3.4 Which semantics in practice?

Having surveyed three rather different brands of semantics, we areleft with the question of which of these brands serves best incomputational linguistic practice. It should be clear from what hasbeen said above that the choice of semantic “tool” will depend on thecomputational goals of the practitioner. If the goal, for example, isto create a dialogue-based problem-solving system for circuit faultdiagnosis, emergency response, medical contingencies, or vacationplanning, then an approach based on logical (or at least symbolic)representations of the dialogue, underlying intentions, and relevantconstraints and knowledge is at present the only viable option. Here itis of less importance whether the symbolic representations are based onsome presumed logical semantics for language, or some theory of mentalrepresentation—as long as they are representations that can bereasoned with. The most important limitations that disqualifysubsymbolic and statistical representations of meaning for suchpurposes are their very limited inferential reach and responsecapabilities. They provide classifications or one-shot inferencesrather than reasoning chains, and they do not generate plans,justifications, or extended linguistic responses. However, both neuralnet techniques and statistical techniques can help to improve semanticprocessing in dialogue systems, for example by disambiguating wordsenses, or recognizing which of several standard plans is beingproposed or followed, on the basis of observed utterances or actions.

On the other hand, if the computational goal is to demonstratehuman-like performance in a biologically plausible (or biologicallyvalid!) model of some form of language-related behavior, such aslearning to apply words correctly to perceived objects orrelationships, or learning to judge concept similarity, or to assessthe tone (underlying sentiment) of a discourse segment, then symbolicrepresentations need not play any role in the computational modeling.(However, to the extent that language is symbolic, and is a cognitivephenomenon, subsymbolic theories must ultimately explain how languagecan come about.) In the case of statistical semantics, practicalapplications such as question-answering based on large textualresources, retrieval of documents relevant to a query, or machinetranslation are at present greatly superior to logical systems thatattempt to fully understand both the query or text they are confrontedwith and the knowledge they bring to bear on the task. But some of thetrends pointed out above in trying to link subsymbolic and statisticalrepresentations with symbolic ones indicate that a gradual convergenceof the various approaches to semantics is taking place.

4. Semantic interpretation

4.1 Mapping syntactic trees to logical forms

For the next few paragraphs, we shall takesemantic interpretationto refer to the process of deriving meaning representations from a wordstream, taking for granted the operation of a prior or concurrentparsing phase. In other words, we are mapping syntactic trees tological forms (or whatever our meaning representation may be). Thus,unlike interpretation in the sense of assigning external denotations tosymbols, this is a form of “syntactic semantics” (Rapaport 1995).

In the heyday of the proceduralist paradigm, semantic interpretationwas typically accomplished with sets of rules that matched patterns toparts of syntactic trees and added to or otherwise modified thesemantic representations of input sentences. The completedrepresentations might either express facts to be remembered, or mightthemselves be executable commands, such as formal queries to a databaseor high-level instructions placing one block on another in a robot's(simulated or real) world.

When it became clear in the early 1980s, however, that syntactictrees could be mapped to semantic representations by usingcompositional semantic rules associated with phrase structure rules inone-to-one fashion, this approach became broadly favored over pureproceduralist ones. In our earlier discussion insection 3.1) ofmeaning representations within logicist frameworks, we alreadyforeshadowed the essentials of logical form computation. There we sawsample interpretive rules for a small number of phrase structure rulesand vocabulary. The semantic rules, such as NP′= Det′(N′), clearly indicate how logicalforms of lower-level constituents should be combined to yield those ofhigher-level constituents. In the following figure, the sentence“Thetis loves a mortal” has been interpreted by applying theearlier set of lexical and interpretive rules to the nodes of thephrase structure tree in a bottom-up, left-to-right sweep:

[a tree.  Parent is S-prime =NP-prime(VP) = ∃z[mortal(z) ∧ loves(z)(Thetis)].  First node isNP-prime = Name-prime = λPP(Thetis) with a node of Name-prime =λPP(Thetis) which has a node Thetis (in bold).  Second node isVP-prime = λxNP-prime(λyV-prime(y)(x)) = λx(∃z[mortal(z) ∧loves(z)(x)]) with first a node of V-prime = loves which has a nodeloves (in bold) and with second a node of NP-prime =Det-prime(N-prime) λQ(∃z[mortal(z) ∧ Q(z)]) which itself has a leftnode of Def-prime = λP λQ(∃z[P(z) ∧ Q(z)]) which has a node of a (inbold) and a right node of N-prime = mortal which has a node of mortal(in bold)]
Figure 4: Semantic interpretation of the parse tree of Figure 1

The interpretive rules are repeated at the tree nodes fromsection 3.1, and the result of applyingthe combinatory rules (with lambda-conversions where possible) areshown as well. As can be seen, the Montagovian treatment of NPs assecond-order predicates leads to some complications, and these areexacerbated when we try to take account of quantifier scopeambiguity. We mentioned Montague's use of multiple parses, theCooper-storage approach, and the unscoped-quantifier approach to thisissue insection 3.1. In theunscoped-quantifier approach, with a relational interpretation of theverb, the respective interpretations of the leaf nodes (words) inFigure 4 would becomeThetisyλx(loves(x, y),λP <⟨∃P⟩>),andmortal, and S′ at the root wouldbecomeloves(Thetis,⟨∃mortal⟩), to be scoped uniquely to(∃x:mortal(x)loves(Thetis, x)). It is easy to see that multipleunscoped quantifiers will give rise to multiple permutations ofquantifier order when the quantifiers are brought to the sentencelevel. Thus we will have multiple readings in sentences such as“Every man loves a certain woman”.

4.2 Subordinating the role of syntax

At this point we should pause to consider some interpretive methodsthat do not conform with the above very common but not universallyemployed syntax-driven approach. First, Schank and his collaboratorsemphasized the role of lexical knowledge, especially primitive actionsused in verb decomposition, and knowledge about stereotyped patternsof behavior in the interpretive process, nearly to the exclusionof syntax. For example, a sentence beginning “John went …” would lead to the generation ofaptrans conceptualization (since go is lexicallyinterpreted in terms ofptrans), where John fills theagent role and where a phrase interpretable as a location is expected,as part of the configuration of roles that attach toaptrans act. If the sentence then continues as“…to the restaurant”, the expectation isconfirmed, and at this point instantiation of a restaurant script istriggered, creating expectations about the likely sequence of actionsby John and other agents in the restaurant (e.g., Schank and Abelson1977). These ideas had considerable appeal, and led to unprecedentedsuccesses in machine understanding of some paragraph-length stories.Another approach to interpretation that subordinates syntax tosemantics is one that employs domain-specific semantic grammars (Brownand Burton 1975). While these resemble context-free syntactic grammars(perhaps procedurally implemented in ATN-like manner), theirconstituents are chosen to be meaningful in the chosen applicationdomain. For example, an electronics tutoring system might employcategories such asmeasurement, hypothesis,ortransistor instead of NP, andfault-specificationorvoltage-specification instead of VP. The importance ofthese approaches lay in their recognition of the fact that knowledgepowerfully shapes our ultimate interpretation of text anddialogue, enabling understanding even in the presence of noisy,flawed, and partial linguistic input. Nonetheless, most of the NL understanding community since the 1970s has treated syntacticparsing as an important aspect of the understanding process, in partbecause modularization of this complex process is thought to becrucial for scalability, and in part because of the very plausibleChomskian contention that human syntactic intuitions operate reliablyeven in the absence of clear meaning, as in his famous sentence“Colorless green ideas sleep furiously”.

Statistical NLP has only recently begun to be concerned withderiving interpretations usable for inference and question answering(and as pointed out in the previous subsection, some of the literaturein this area assumes that the NL text itself can and should be usedas the basis for inference). However, there have been some noteworthyefforts to build statistical semantic parsers that learn to produceLFs after training on a corpus of LF-annotated sentences, or a corpusof questions and answers (or other exchanges) where the learning is“grounded” in a database or other supplementary models. Wewill mention examples of this type of work, and comment on itsprospects, insection 8.

4.3 Coping with semantic ambiguity and underspecification

We noted earlier that language is potentially ambiguous at alllevels of syntactic structure, and the same is true of semanticcontent, even for syntactically unambiguous words, phrases andsentences. For example, words likebank,recover, andcool havemultiple meanings even as members of the same lexical category; nominalcompounds such asice bucket, icesculpture, olive oil, orbabyoil leave unspecified the underlying relation between thenominals (such as constituency or purpose). At the sentential level,even for a determinate parse there may be quantifier scope ambiguities(“Every man admires a certain woman”—Rosa Parks vs.his mother); and habitual and genericsentences often involve temporal/atemporal ambiguities(“Racehorses are (often) skittish”), among others.

Many techniques have been proposed for dealing with the varioussorts of semantic ambiguities, ranging from psychologicallymotivated principles, to knowledge-based methods,heuristics, and statistical approaches. Psychologicallymotivated principles are exemplified by Quillian's spreadingactivation model (described earlier) and the use ofselectionalpreferences in word sense disambiguation. For example,in “The job took five hours,”took might bedisambiguated to the sense of taking up time because that sense of theverb prefers a temporal complement, andjob might bedisambiguated totask (rather than, say,occupation)because of the direct associative link between the concept of a taskand its time demands. Examples of knowledge-based disambiguation wouldbe the disambiguation ofice sculpture to a constitutiverelation based on the knowledge that sculptures may be carved orconstructed from solid materials, or the disambiguation ofa manwith a hat to awearing-relation based on the knowledgethat a hat is normally worn on the head. (The possible meanings mayfirst be narrowed down using heuristics concerning the limited typesof relations typically indicated by nominal compounding or bywith-modification.) Heuristic principles used in scopedisambiguation includeisland constraints (quantifiers suchasevery andmost cannot expand their scope beyondtheir local clause) and differing wide-scoping tendencies fordifferent quantifiers (e.g.,each is apt to assume widerscope thansome). Statistical approaches typically extractvarious features in the vicinity of an ambiguous word or phrase thatare thought to influence the choice to be made, and then make thatchoice with a classifier that has been trained on an annotated textcorpus. The features used might be particular nearby words or theirparts of speech or semantic categories, syntactic dependencyrelations, morphological features, etc.. Such techniques havethe advantage of learnability and robustness, but ultimately willrequire supplementation with knowledge-based techniques. For example,the correct scoping of quantifiers in contrasting sentence pairs suchas

(4.1)
Every child at the picnic was roastinga wiener.
(4.2)
Every child at the picnic was watchinga hotair balloon overhead,

seems to depend on world knowledge ina way unlikely to be captured as a word-level statistical regularity.

Habitual and generic sentences present particularly challengingdisambiguation problems, as they may involve temporal/atemporalambiguities (as noted), and in addition may require augmentation withquantifying adverbs and constraints on quantificational domainsmissing from the surface form. For example,

(4.3)
Racehorses are (often) skittish when they arepurebred

without the quantifying adverboften is unambiguouslyatemporal, ascribing enduring skittishness to purebred racehorsesin general. (Thusin general appears to be theimplicit default adverbial.) But when the quantifying adverb ispresent, the sentence admits both an atemporal reading according towhich many purebred racehorses are characteristically skittish, aswell as a temporal reading to the effect that purebred racehorsesin general are subject to frequent episodes ofskittishness. If we replacepurebred byat the startinggate, then only the episodic reading ofskittishremains available, whileoften may quantify over racehorses,implying that many are habitually skittish at the starting gate, or itmay quantify over starting-gate situations, implying that racehorsesin general are often skittish in such situations; furthermore, makingformal sense of the phraseat the starting gate evidentlydepends on knowledge about horse racing scenarios.

The interpretive challenges presented by such sentences are (or shouldbe) of great concern in computational linguistics, since much ofpeople's general knowledge about the world is most naturally expressedin the form of generic and habitual sentences. Systematicways of interpreting and disambiguating such sentences wouldimmediately provide a way of funneling large amounts of knowledge intoformal knowledge bases from sources such as lexicons, encyclopedias,and crowd-sourced collections of generic claims such as those in OpenMind Common Sense (e.g., Singhet al. 2002; Lieberman et al. 2004; Havasi etal. 2007). Many theorists assume that the logical forms of suchsentences should be tripartite structures with a quantifier thatquantifies over objects or situations, a restrictor that limits thequantificational domain, and a nuclear scope (main clause) that makesan assertion about the elements of the domain (e.g., see Carlson2011; Cohen 2002; or Carlson & Pelletier1995). The challenge lies in specifying a mapping from surfacestructure to such a logical form. While many of the principlesunderlying the ambiguities illustrated above are reasonably wellunderstood, general interpretive algorithms are still lacking. Itappears that such algorithms will involve stepwise elaboration of aninitially incomplete, ambiguous logical form, rather than astraightforward syntax-semantics transduction, since the features onwhich the correct formalization depends transcend syntax: They includeones such as Carlson's individual-level/stage-level distinction amongverb phrases and his object-level/kind-level distinction among verbarguments (Carlson 1977, 1982), as well as pragmatic features such asthe given/new distinction (influenced by phrasal accent), lexicalpresuppositions, linguistic context, and background knowledge.

5. Making sense of text

The dividing line between semantic interpretation (computing anddisambiguating logical forms) and discourse understanding—makingsense of text—is a rather arbitrary one. However, heavily context-and knowledge-dependent aspects of the understanding process, such asresolving anaphora, interpreting context-dependent nominal compounds,filling in “missing” material, determining implicit temporal and causalrelationships (among other “coherence relations”), interpreting looseor figurative language and certainly, integrating linguisticallyderived information with preexisting knowledge are generally counted asaspects of discourse processing.

5.1 Dealing with reference and various forms of “missing material”

Language has evolved to convey information as efficiently aspossible, and as a result avoids lengthy identifying descriptions andother lengthy phrasings where shorter ones will do. One aspect of thistendency towards “shorthand” is seen inanaphora,the phenomenon of coreference between an earlier, potentially moredescriptive NP and a later anaphoric pronoun or definite NP (with adeterminer likethe orthese). (The reverse sequencing,cataphora, is seenoccasionally as well.) Coreference is yet another source of ambiguityin language, scarcely noticeable by human language users (except inambivalent cases such as “When Flight 77 hit the Pentagon'swall, it disintegrated”), but problematic for machines.

Determining the (co)referents of anaphors can be approached in avariety of ways, as in the case of semantic disambiguation. Linguisticand psycholinguistic principles that have been proposed includegenderand number agreement of coreferential terms,C-command principles (e.g., an anaphor must be a pronoun ifits referent is a sibling of one of its ancestors in the parse tree),(non)reflexive constraints (e.g., the subject andobject cannot be coreferential in a simple clause such as“John blamedhim for the accident”),recency/salience (more recent/salient referents are preferred), andcentering (the most likely term to be pronominalized in anutterance is the “center of attention”). An earlyheuristic algorithm that employed several features of this type tointerpret anaphors was that of Hobbs (1979). But selectionalpreferences are important as well. For example, in the sentence“He bumped againstthe sauce boat containingtheraspberry syrup, spillingit,” the pronoun can bedetermined to be coreferential withthe raspberry syruprather thanthe sauce boat becausespill prefers a liquid (or loose aggregate) as itsobject. With the alternative continuation, “… knockingit over,” the choice of coreferent would be reversed,becauseknock over prefers something solid and upright as itsobject. More subtle world knowledge may be involved as well, as inTerry Winograd's well-known example, “The city councilrefusedthe women a parade permit becausetheyfeared/advocated violence,” wherethey may refer to the city council or the women, depending onthe choice of verb and the corresponding stereotypes that areevoked. Another complication concerns reference to collections ofentities, related entities (such as parts), propositions, and eventsthat can become referents of pronouns (such asthey, this,andthat) or of definite NPs (such asthis situationorthe door (of the house)) without having appearedexplicitly as a noun phrase. Like other sorts of ambiguity,coreference ambiguity has been tackled with statistical techniques.These typically take into account factors like those mentioned, alongwith additional features such as antecedent animacy and priorfrequency of occurrence, and use these as probabilistic evidence inmaking a choice of antecedent (e.g., Haghighi & Klein 2010).Parameters of the model are learned from a corpus annotated withcoreference relations and the requisite syntactic analyses.

Coming back briefly to nominal compounds of form N N, note thatunlike conventional compounds such asice bucket oricesculpture—ones approachable using an enriched lexicon,heuristic rules, or statistical techniques—some compounds canacquire a variety of meanings as a function of context. Forexample,rabbit guy could refer to entirely different thingsin a story about a fellow wearing a rabbit suit, or one about a rabbitbreeder, or one about large intelligent leporids from outerspace. Such examples reveal certain parallels between compound nominalinterpretation and anaphora resolution: At least in themore difficult cases, N N interpretation depends on previously seenmaterial, and on having understood crucial aspects of that previousmaterial (in the current example, the concepts of wearing a rabbitsuit, being a breeder of rabbits, or being a rabbit-like creature). Inother words N N interpretation, like anaphora resolution, isultimately knowledge-dependent, whether that knowledge comes fromprior text, or from a preexisting store of background knowledge. Astrong version of this view is seen in the work of Fan etal. (2009), where it is assumed that in technical contexts, evenmany seemingly conventional compounds require knowledge-basedelaboration. For example, in a chemistry context,HCLsolution is assumed to require elaboration into somethinglike:solution whose base is a chemical whose basic structuralconstituents are HCL molecules. Algorithms are provided (andtested empirically) that search for a relational path (subject tocertain general constraints) from the modified N to the modifying N,selecting such a relational path as the meaning of the N Ncompound. As the authors note, this is essentially aspreading-activation algorithm, and they suggest more generalapplication of this method (seesection 5.3 on integrated interpretivemethods).

While anaphors and certain nominal compounds can be regarded asabbreviated encodings of semantic content, other forms of“shorthand” leave out semantically essential materialaltogether, requiring the reader or hearer to fill it in. Onepervasive phenomenon of this type is of courseellipsis, asillustrated earlier in sentences(2.5) and(2.6), or by the followingexamples.

(5.1)
Shakespeare made upwords, and so can you.
(5.2)
Felix is under more pressure than I am.

In (5.1), so is a place-holder for the VP make up words (in aninverted sentence), and (5.2) tacitly contains a final predicatesomething likeunder amount x of pressure, where that amountx needs to be related to the (larger) amount of pressureexperienced by Felix. Interpreting ellipsis requires filling in ofmissing material; this can often be found at the surface level as asequence of consecutive words (as in the gapping and bare ellipsisexamples2.5 and 2.6), but as seen in (5.1) and (5.2), mayinstead (or in addition) require adaptation of the imported materialto fit semantically into the new context. Further complications arisewhen the imported material contains referring expressions, as in thefollowing variant of (5.2):

(5.2′)
Felix is under more pressure from his boss than Iam.

Here the missing material may refer either to Felix's bossormy boss (called thestrict andsloppyreading respectively), a distinction that can be captured by regardingthe logical form of the antecedent VP as containing only one, or two,occurrences of the lambda-abstracted subject variable, i.e.,schematically,

λx[x is under more pressure from Felix's boss],
versus
λx[x is under more pressure from x's boss].

The two readings can be thought of as resulting respectively fromscopinghis boss first, thenfilling in the elided material, and thereverse ordering of these operations (Dalrymple et al. 1991; see alsoCrouch 1995; Gregory and Lappin 1997). Other challenging forms ofellipsis are event ellipsis, as in (5.3) (whereforgot stands forforgot to bring), entirelyverbless sentences such as (5.4) and (5.5),and subjectless, verbless ones like (5.6) and (5.7):

(5.3)
I forgot the keys
(5.4)
Hence this proposal
(5.5)
Flights from Rochester to Orlando, May 28?
(5.6)
What a pity.
(5.7)
How about the Deltaflight instead?

In applications these and some other forms of ellipsis are handled,where possible, by (a) making strong use of domain-dependentexpectations about the types of information and speech acts that arelikely to occur in the discourse, such as requests for flightinformation in an air travel adviser; and (b) interpreting utterancesas providing augmentations or modifications of domain-specificknowledge representations built up so far. Corpus-based approaches toellipsis have so far focused mainly on identifying instances of VPellipsis in text, and finding the corresponding antecedent material, asproblems separate from that of computing correct logical forms (e.g.,see Hardt 1997; Nielsen 2004).

Another refractory missing-material phenomenon is that ofimplicitarguments. For example, in the sentence

(5.8)
Some carbon monoxideleaked into the car from the exhaust, but itsconcentration was too low to pose any hazard,

the reader needs to conceptually expandits concentrationintoits concentration in the air in the interior of the car,andhazard intohazard for occupants of the car.In this example, lexical knowledge tothe effect that concentration (in the chemical sense) refers tothe concentration of some substance in some medium could at leastprovide the “slots” that need to be filled, and a similar commentapplies in the case ofhazard.However, not all of the fillers forthose slots are made explicitly available by the text—the carbonmonoxide referred to provides one of the fillers, but the air in theinterior of the car, and potential occupants of the car (and that theyrather than, say, the upholstery would be at risk) are a matter ofinference from world knowledge.

Finally, another form of shorthand that is common in certaincontexts ismetonymy, where aterm saliently related to an intendedreferent stands for that referent. For example, in an airport context,

(5.9)
Is thisflight 574?

might stand for “Is thisthe departure lounge for flight 574?”Similarly, in appropriate contextscherrycan stand forcherry icecream, andBMW canstand forBMW's stock market index:

(5.10)
I'd like two scoopsofcherry.
(5.11)
BMW rose 4 points.

Like other types of underspecification, metonymy has beenapproached both from knowledge-based and corpus-basedperspectives. Knowledge that can be brought to bear includesselectional preferences (e.g., companies in general do not literallyrise), lexical concept hierarchies (e.g., as provided byWordNet), generic knowledge about the types of metonymy relationscommonly encountered, such aspart-for-whole, place-for-event,object-for-user, producer-for-product, etc. (Lakoff and Johnson1980), rules for when to conjecture such relations (e.g., Weischedeland Sondheimer 1983), named-entity knowledge, and knowledge aboutrelated entities (e.g., a company may have a stock market index, andthis index may rise or fall) (e.g., Bouaud et al. 1996; Onyshkevych1998). Corpus-based methods (e.g., see Markert and Nissim 2007) oftenemploy many of these knowledge resources, along with linguistic andstatistical features such as POS tags, dependency paths andcollocations in the vicinity of a potential metonym. As for otherfacets of the interpretive process (including parsing), use of deepdomain knowledge for metonym processing can be quite effective insufficiently narrow domains, while corpus-based, shallow methods scalebetter to broader domains, but are apt to reach a performance plateaufalling well short of human standards.

5.2 Making connections

Text and spoken language do not consist of isolated sentences, butof connected, interrelated utterances, forming a coherentwhole—typically, a temporally and causally structured narrative,a systematic description or explanation, a sequence of instructions,or a structured argument for a conclusion (or in dialogue, asdiscussed later, question-answer exchanges, requests followed byacknowledgments, mixed-initiative planning, etc.).

This structure is already apparent at the level of pairs of consecutiveclauses, such as

(5.12)
John looked out at the sky. It was darkwiththunderclouds.
(5.13)
John looked out at the sky, and decided totake alonghis umbrella.

In (5.12), we understand that John's looking at the skytemporallyoverlaps the presence of the dark clouds in the sky (i.e., thedark-cloud situation at least contains the end of the looking event).At a deeper level, we also understand that Johnperceived thesky to be dark with thunderclouds, and naturally assume that John tookthe clouds to be harbingers of an impending storm, asweourselves would. In (5.13), the two clauses appear toreportsuccessive events and furthermore, the first event isunderstood to have led causally to the second—John's decisionwas driven by whatever he saw upon looking at the sky; and based onour knowledge of weather and the function of umbrellas, and the factthat “everyone” possesses that knowledge, we further inferthat John perceived potential rainclouds, and intended to fend off anyrain with his umbrella in an imminent excursion.

The examples show that interpreting extended multi-clausaldiscourses depends on both narrative conventions and world knowledge;(similarly for descriptive, instructional or argumentative text) . Inparticular, an action sentence followed by a static observation, as in(5.12), typically suggests the kind of action-situation overlap wenoted, and successively reported actions or events, as in (5.13),typically suggest temporal sequencing and perhaps a causal connection,especially if one of the two clauses isnot a volitional action. Thesesuggestive inferences presumably reflect the narrator's adherence to aGricean principle of orderliness, though such an observation is littlehelp from a computational perspective. The concrete task is toformulate coherence principles for narrative and other forms ofdiscourses and to elucidate, in a usable form, the particularsyntactico-semantic properties at various levels of granularity thatcontribute to coherence.

Thus various types of rhetorical or coherence relations (betweenclauses or larger discourse segments) have been proposed in theliterature, e.g., by Hobbs (1979), Grosz & Sidner (1986), and Mann& Thompson (1988). Proposed coherence relations are ones likeelaboration, exemplification, parallelism, and contrast. We deferfurther discussion of rhetorical structuretosection 6 (on language generation).

5.3 Dealing with figurative language

“[I'm] behind the eight ball, aheadof the curve, riding the wave, dodging the bullet and pushing theenvelope. I'm on point, on task, on message and off drugs… I'min the moment, on the edge, over the top and under the radar. Ahigh-concept, low-profile, medium-range ballistic missionary.”–George Carlin(“Life is worth losing”, first broadcast on HBO, November5, 2005)

We have already commented on processing metonymy, which isconventionally counted as a figure of speech—a word or phrasestanding for something other than its literal meaning. However, whilemetonymy is essentially an abridging device, other figurative modes,such as metaphor, simile, idioms, irony, personification, or hyperbole(overstatement) convey meanings, especially connotative ones, noteasily conveyed in other ways. We focus on metaphor here, as it is ina sense a more general form of several other tropes. Moreover, it hasreceived the most attention from computational linguists, because theargument can be made that metaphor pervades language, with no sharpdemarcation between literal and metaphorical usage (e.g., Wilks 1978;Carbonell 1980; Lakoff and Johnson 1980; Barnden 2006). For example,while “The temperature dropped” can be viewed as involvinga sense ofdrop that is synonymous withdecrease, it can alsobe viewed as a conventional metaphor comparing the decreasingtemperature to a falling object. As a way of allowing for examples ofthis type, Wilks offered a processing paradigm in which selectionalconstraints (such as a physical-object constraint on the subjectofdrop) are treated as mere preferences rather than firmrequirements.

However, processing metaphor requires more than relaxation ofpreferences; it is both context-dependent and profoundlyknowledge-dependent. For example,

(5.14)
He threw in the towel

can be a literal description of a mundane act in a laundromat setting,a literal description of a symbolic act by a boxer's handler, or astock metaphor for conceding defeat in any difficult endeavor. But tograsp the metaphorical meaning fully, including the connotation of apunishing, doomed struggle, requires a vivid conception of what aboxing match is like.

In approaching metaphor computationally, some authors, e.g., DedreGentner (see Falkenhainer et al.1989), have viewed it as depending onshared properties and relational structure (while allowing fordiscordant ones), directly attached to the concepts being compared. Forexample, in comparing an atom to the solar system, we observe arevolves-around relation between electrons and the nucleus onthe one hand and between planets and the sun on the other. But othershave pointed out that the implicit comparison may hinge on propertiesreached only indirectly. In this view, finding a metaphor for aconcept is a process of moving away from the original concept in aknowledge network in a series of steps, each of which transforms somecurrent characteristic into a related one. This is the process termed“slippage” by Hofstadter et al. (1995). Others (e.g.,Martin 1990, drawing on Lakoff and Johnson 1980) emphasize preexistingknowledge of conventional ways of bridging metaphorically from oneconcept to another, such as casting a nonliving thing as a livingthing.

In view of the dependence of metaphor on context and extensiveknowledge, and the myriad difficulties still confronting all aspectsof language understanding, it is not surprising that no general systemfor processing metaphor in context exists, let alone for usingmetaphor creatively. Still, Martin's MIDAS program was able tointerpret a variety of metaphors in the context of a language-basedUnix adviser, relying on knowledge about the domain and aboutmetaphorical mapping, hand-coded in the KODIAK knowledgerepresentation language. Also, several other programs havedemonstrated a capacity to analyze or generate various examples ofmetaphors, including the Structure Mapping Engine (SME) (Falkenhaineret al. 1989), Met* (Fass 1991), ATT-Meta (Barnden 2001), KARMA(Narayanan 1997) and others. More recently, Veale and Hao (2008)undertook an empirical study of a slippage-based approach to metaphor,using attributes collected from WordNet and the web. In a similarspirit, but taking SME as his inspiration, Turney (2008) implemented a“Latent Relation Mapping Engine” (LRME) to find the bestmapping between the elements of two potentially comparabledescriptions (of equal size); the idea is to use web-basedco-occurrence statistics to gauge not only the attribute similarity ofany two given concepts (such aselectron andplanet)but also the relational similarity of any two given pairs of concepts(such aselectron:nucleus andplanet:sun), usingthese as metrics in optimizing the mapping.

5.4 Integrated methods

Obviously, the many forms of syntactic, semantic, and pragmaticambiguity and underspecification enumerated in the preceding sectionsinteract with one another and with world knowledge. For example, wordsense disambiguation, reference resolution, and metaphor interpretationare interdependent in the sentence

(5.15)
The Nebraska Supreme Courtthrew out the chair because it deemed electrocution to be a cruel formof capital punishment.

Note first of all thatit could refer syntactically tothe Nebraska Supreme Court or tothe chair, butworld knowledge rules out the possibility of a neuter-gendered chairmanifesting a mental attitude. Note as well that ifit isreplaced byhe, thenthe chair isreinterpreted as a person and becomes the referent of the pronoun; atthe same time,threw out is then reconstrued as a metaphormeaningremoved from office (with an implication ofruthlessness).

Thus it seems essential to find a uniform framework for jointlyresolving all forms of ambiguity and underspecification, at least tothe extent that their resolution impacts inference. Some frameworksthat have been proposed areweighted abduction, constraintsolving, and“loose-speak” interpretation. Weightedabduction (Hobbs et al. 1993) is based on the idea that thetask of the hearer or reader is toexplain the word sequencecomprising a sentence by viewing the meaning of that sentence asa logical consequence of general and contextual knowledge along withsome assumptions, to be kept as “lightweight” as possible. Theconstraint-solving approach views syntax, semantics, pragmatics,context, and world knowledge as supplyingconstraints oninterpretations that need to be satisfied simultaneously. Oftenconstraints are treated as defeasible, or ranked, in which case thegoal is to minimize constraint violations, particularly of relativelystrong constraints. (There is a connection here toOptimality Theory in cognitive language modeling.)Loose-speak interpretation (Fan et al. 2009, cited previouslyin connection with nominal compound interpretation) sets asidesyntactic ambiguity but tries to deal with pervasive semanticlooseness in the meanings of nominal compounds, metonymy, and otherlinguistic devices. It does so by expanding semantic triples (from thepreliminary logical form) of type ⟨Class1,relation,Class2⟩, where the relation cannotproperly relate the specified classes, into longer chains containingthat relation and terminating at those classes. Finding such chainsdepends on background knowledge about the relations that are possiblein the domain of interest.

Prospects

The methods just mentioned have been applied in restricted tasks,but have not solved the problem of comprehensive languageinterpretation. They all face efficiency problems, and—since theyall depend on a rich base of linguistic and world knowledge—theknowledge acquisition bottleneck. We comment on the efficiency issuehere, but leave the discussion of knowledge acquisition tosection 8.

In view of the speed with which people disambiguate and comprehendlanguage, one may surmise that these processes more closely resemblefitting the observed texts or utterances to familiar patterns, thansolving complex inference or constraint satisfaction problems. Forexample, in a sentence such as

(5.16)
He dropped the cutting board on the glassand itdidn't break,

the pronoun is understood to refer to the glass, even though worldknowledge would predict that the glass broke, whereas the cutting boarddid not. (The idea that communication focuses on the unexpected is ofno help here, because the referent remains unchanged if we changedidn't break tobroke.) This would be expected ifthe interpretiveprocess simply found a match between the concept of a fragile objectbreaking and the glass breaking (regardless of the exact logicalstructure), and used that match in choosing a referent. Theprocessing of the example from Winograd, insection 5.1, concerningrefusal of a parade permit to a group of women, may similarly depend inpart on the familiarity of the idea that people who (seek to) parademay advocate some cause. Note that in

(5.17)
The city council granted the women aparade permitbecause they did not advocateviolence

the women are still preferred as the referent ofthey, even though itis generally true that stalwarts of society, such as city councillors,do not advocate violence.

If disambiguation and (at least preliminary) interpretation inlanguage understanding turn out to be processes guided more by learnedpatterns than by formalized knowledge, then methods similar to thoseused in feature-based statistical NLP may be applicable to theireffective mechanization.

6. Language generation

Because language generation is a purposeful activity motivated byinternal goals, it is difficult to draw a boundary betweengoal-directed thinking and the ensuing production of linguistic output.Often the process is divided intocontentplanning, microplanning,surface realization, andphysicalpresentation. While the last three ofthese stages can be thought of as operating (in the order stated) onrelatively small chunks of information (e.g., resulting in one or twosentences or other utterance types), content planning is oftenregarded as a continual process of goal-directed communicativeplanning, which successively hands over small clusters of ideas to theremaining stages for actual generation when appropriate. We discuss thelatter transduction first, in order to highlight its relationship tounderstanding.

The transduction of a small set of internally represented ideas intowritten or oral text is in an obvious sense the inverse of theunderstanding process, as we have outlined it in the preceding sections2–5. In other words, starting from a few internally represented ideasto be communicated, we need to proceed to an orderly linear arrangementof these ideas, to a surface-oriented logical form that is concise andnonrepetitive and appropriately indexical (e.g., in the use ofI, you,here, now, and referring expressions), and finally to an actualsurfacerealization and physical presentation of that realization as spoken orwritten text. Most or all of the kinds of knowledge involved inunderstanding naturally come into play in generation as well—whetherthe knowledge is about the structure of words and phrases, about therelationship between structure and meaning representation, aboutconventional (or creative) ways of phrasing ideas, about discoursestructure and relations, or about the world.

Despite this inverse relationship, language generation hastraditionally received less attention from computational linguists thanlanguage understanding, because if the content to be verbalized isavailable in an unambiguous, formal representation, standard outputtemplates can often be used for generation, at least in sufficientlynarrow domains. Even for unrestricted domains, the transduction from anexplicit, unambiguous internal semantic representation to a wordsequence is much less problematic than reconstruction of an explicitsemantic representation from an ambiguous word sequence. A similarasymmetry holds at the level of speech recognition and generation,accounting for the fact that serviceable speech generators (e.g.,reading machines for the blind) have been available much longer (sinceabout 1976) than serviceable speech recognizers (appearing around 1999).

The microplanning process that leads from a few internal ideas to asurface-oriented, indexical representation typically starts byidentifying “chunks” expected to be verbalized asparticular types of syntactic constituents, such as NPs, VPs orPPs. This is often followed by making choices of more surface-orientedconcepts (or directly, lexical heads) in terms of which the chunkswill be expressed. The nature of these processes depends very much onthe internal representation. For example, if the representation isframed in terms of very abstract primitives, thematic relations, andattribute-value descriptions of entities, then the chunks might besets of thematic relations centered around an action, and sets ofsalient attributes of entities to be referred to. If the internalrepresentation is instead more language-like, then chunks will berelatively small, often single propositions, and the lexical choiceprocess will resemble internal paraphrasing of the logical forms. Ifmore than one idea is being verbalized, ordering decisions will needto be made. For example, reporting that a bandit brandishing a gunentered a local liquor store might be reported in that order, or as“A bandit entered a local liquor store, brandishing agun”. In dialogue, if a contribution involves both supplying andrequesting information, the request should come last. In other casestransformations to more concise forms may be needed to bring therepresented ideas stylistically closer to surface form. For example,from logical forms expressing thatJohn ate a burrito containingchicken meat andMary ate a burrito containing chickenmeat, a more compact surface-oriented LF might be generated tothe effect thatJohn and Mary each had a chickenburrito. More subtle stylistic choices might be made aswell—for example, in casual discourse,eating might bedescribed aspolishing off (assuming that the internalrepresentation allows for such surface-oriented distinctions).Furthermore, as already mentioned, a surface-oriented LF needs tointroduce contextually appropriate referring expressions such asdefinite descriptions and pronouns, in conformity with pragmaticconstraints on the usage of such expressions.

The above outline is simplified in that it neglects certainsubtleties of discourse and context. A sentence or other utterancetype generally involves bothnew andold (given,presupposed) information, and furthermore, some of the conceptsinvolved may be more strongly focused than others. For example, in thesentence “The item you ordered costsninety dollars, notnine,” the existence and identity of the item, and the fact thatit was ordered, are presumed to be part of the common ground in thecurrent context (old), and so is the addressee's belief that the costof the item is $9; only the corrected cost of $90 is newinformation. The emphasis onninety draws attention to thepart of the presumed belief that is being corrected. Thus it is notonly conceptual content that needs to be available to themicroplanning stage, but also indications as to what is new and whatis old, and what aspects are to be stressed or focused. The plannerneeds at least to apply knowledge about the phrasing of new and oldinformation (e.g., the use of indefinite vs. definite NPs), about thelexical presuppositions and implicatures of the items used (e.g., thatsucceeding presupposestrying, thatregrettingthat φ presupposes thatφ, or thatsomeimplicatesnot all), about the presuppositions of stresspatterns, and about focusing devices (such as stress andtopicalization). The effect of applying these sorts of pragmaticknowledge will be to appropriately configure the surface-oriented LFsthat are handed off to the surface realizer, or, for pragmaticfeatures that cannot be incorporated into the LFs, to annotate the LFswith these features (e.g., stress marking).

The penultimate step is surface realization, using knowledge about thesyntax-semantics interface. In systems for restricted domains, thisknowledge might consist of heuristic rules and templates (perhaps treeschemas) for verbalizing LFs. More broadly-aimed generators might makeuse ofinvertible grammars,ones formulated in rule-to-rule fashion andallowing transduction from logical forms to surface forms in much thesame way as the “forward” transduction from surface phrase structure tological form. Sophisticated generators also need to take account ofpragmatic annotations, such as stress, mentioned above. Finally, thelinguistically expressed content needs to be physically presented asspoken or written text, with due attention to articulation, stress, andintonation, or, for written text, punctuation, capitalization, choiceofa oran as indefinite article, italics, line breaks, and so on.

Returning now to content planning, this process may be expansive orvery limited in scope; for example, it may be aimed at providingdetails of a complex object, set of events, or argument, or it may seekto present no more than a single fact, greeting or acknowledgment. Weleave discussion of the strongly interactive type of content planningneeded for dialogue to the following section, while we comment here onthe more expansive sorts of text generation. In this case theorganization of the information to be presented is the central concern.For example, the events of a narrative or the steps of a procedurewould generally be arranged chronologically; arguments for a conclusionmight be arranged so that any subarguments intended to buttress thepresumptions of that step (immediately) precede it; descriptions ofobjects might proceed from major characteristics and parts to details.

An early method of producing well-organized, paragraph-lengthdescriptions and comparisons was pioneered in the TEXT system ofMcKeown (1985), which used ATN-like organizational schemas to sequencesections of the description of a type of object, such as the object'smore general type, its major parts and distinctive attributes, andillustrative examples. Later work by Hovy (1988) and Moore and Paris(1988) tied content planning more closely to communicative goals byrelying on rhetorical structure theory (RST) (Mann and Thompson1987). RST posits over 20 possible coherence relations between spansof text (usually adjacent). For example, a statement of some claimmay be followed by putative evidence for the claim, thus establishinganevidence relation between the claim (called thenucleus, because it is the main point) and the cited evidence(called asatellite because of its subsidiary role). Anotherexample is thevolitional result relation, where the nucleartext span describes a situation or event of interest, and thesatellite describes a deliberate action that caused the situation orevent. Often these relations are signalled by discourse markers (cuewords and phrases), such asbut, when, yet, orafterall, and it is important in text generation to use these markersappropriately. For example, the use ofwhen in the followingsentence enhances coherence,

(6.1)
Johnhung up the phone when the caller asked for his socialsecurity number,

by signalling a possible volitional result relation. Text spanslinked by rhetorical relations may consist of single or multiplesentences, potentially leading to a recursive (though not necessarilystrictly nested) structure. Rhetorical relations can servecommunicative goals such as concept comprehension (e.g., viaelaboration of a definition), belief (via presentation of evidence),or causal understanding (e.g., via a volitional result relation, as in(6.1)), and in this way tighten the connection between contentplanning and communicative goals.

7. Making sense of, and engaging in, dialogue

“Wecan ask just how it is that rhetoric somehow moves us …Aristotle locates the essential nature of rhetorical undertakings inthe ends sought rather than in the purely formal properties.”–Daniel N. Robinson,Consciousness and Mental Life, (2007:171–2)

Dialogue is interactive goal-directed (purposive) behavior, and in thatsense the most natural form of language. More than in narrative ordescriptive language, the flow of surface utterances and speakeralternation reflect the interplay of underlying speaker goals andintentions. By themselves, however, utterances in a dialogue areambiguous as to their purpose, and an understanding of the discoursecontext and the domain of discourse are required to formulate orunderstand them. For example,

(7.1)
Do you know what day this is?

could be understood as a request for an answer such as“Thursday, June 24”, as a reminder of the importance of theday, or as a test of the addressee's mental alertness.

The immediate goal of such an utterance is to change the mental state(especially beliefs, desires and intentions) of the hearer(s), andspeech act theory concerns theway in which particular types of speechacts effect such changes, directly or indirectly (Austin 1962; Grice1968; Searle 1969). To choose speech acts sensibly, each participantalso needs to take account of the mental state of the other(s); inparticular, each needs torecognizethe other's beliefs, desires andintentions. Discourse conventions in cooperative conversation areadapted to facilitate this process: The speakers employ locutions thatreveal their intended effects, and their acknowledgments andturn-taking consolidate mutual understanding. In this waymixed-initiative dialogue and potentially, cooperative domain actionare achieved.

In the previous discussion ofcontentplanning in language generation,we said little about the formation of communicative intentions in thisprocess. But in the context of purposive dialogue, it isessential to consider how a dialogue agent might arrive at theintention to convey certain ideas, such as episodic, instructional ordescriptive information, a request, an acknowledgment and/oracceptance of a request, an answer to a question, an argument insupport of a conclusion, etc.

As in the case of generating extended descriptions, narrative,arguments,etc., using RST, anatural perspective here is one centeredaround goal-directed planning. In fact, the application of thisperspective to dialogue historically preceded its application toextended discourses. In particular, Cohen and Perreault (1979) proposeda reasoning, planning, and plan recognition framework that representsspeech acts in terms of their preconditions and effects. For example, asimple INFORM speech act might have the following preconditions(formulated for understandability from a first-person speakerperspective):

  • Thehearer (my dialogue partner) does not know whether a certainpropositionX is true;
  • the hearer wants to be told by mewhetherX is true; and
  • I do in fact know whetherX istrue.

The effect of implementing the INFORM as an utterance is then that thehearerknows whether X is true. An important feature of sucha framework is that it can account for indirect speech acts (Allen andPerreault 1980). For example, question (7.1), as an indirect requestfor the date or day of the week, can be viewed as indicating thespeaker's awareness that the hearer can perform the requestedinformation-conveying act only if the knowledge-precondition of thatact is satisfied. Furthermore, since the hearer recognizes thatquestioning a precondition of a potential act is one indirect way ofrequesting the act, then (unless the context provides evidence to thecontrary) the hearer will make the inference that the speaker wantsthe hearer to perform the information-conveying speech act. Note thatthe reasoning and planning framework must allow for iteratedmodalities such as “I believe that you want me to tell you whattoday's date is”, or “I believe (because of the request Ijust made) that you know that I want you to pass the salt shaker tome”. Importantly, there must also be allowance for mutualbeliefs and intentions, so that a common ground can be maintained aspart of the context and collaboration can take place. A belief ismutual if each participant holds the belief, and the participantsmutually believe that they mutually hold the belief. The mutualknowledge of the participants in a dialogue can be assumed to includethe overt contents of their utterances and common general knowledge,including knowledge of discourse conventions.

Since the ultimate purpose of a dialogue may be to accomplish somethingin the world, not only in the minds of the participants, reasoning,goal-directed planning, and action need to occur at the domain level aswell. The goals of speech acts are then not ends in themselves, butmeans to other ends in the domain, perhaps to be accomplished byphysical action (such as equipment repair). As a result, task-orienteddialogues are apt to be structured in a way that follows or “echoes”the structure of the domain entities and the way they can bepurposefully acted upon. Such considerations led to Grosz and Sidner'stheory of dialogue structure in task-oriented dialogues (Grosz andSidner 1986). Their theory centers around the idea of shifts ofattention mediated by pushing and popping of “focus spaces” on a stack.Focus spaces hold in them structured representations of the domainactions under consideration. For example, setting a collaborative goalof attaching a part to some device would push a corresponding focusspace onto the stack. As dictated by knowledge about the physical task,the participants might next verbally commit to the steps of using ascrewdriver and some screws to achieve the goal, and this part of thedialogue would be mediated by pushing corresponding subspaces onto thefocus stack. When a subtask is achieved, the corresponding focus spaceis popped from the stack.

Implementation of reasoning and planning frameworks covering both theiterated modalities needed for plan-based dialogue behavior and therealities of a task domain has proved feasible for constraineddialogues in restricted domains (e.g.,Smith et al. 1995), butquicklycomes up against a complexity barrier when the coverage of language andthe scope of the domain of discourse are enlarged. Planning is ingeneral NP-hard, indeed PSPACE-complete even in propositional planningformalisms (Bylander 1994), and plan recognition can be exponential inthe number of goals to be recognized, even if all plans available forachieving the goals are known in advance (Geib 2004).

In response to this difficulty, researchers striving to build usablesystems have experimented with a variety of strategies. One is topre-equip the dialogue system with a hierarchy of carefully engineeredplans suitable for the type of dialogue to be handled (such astutoring, repair, travel planning or schedule maintenance), and tochoose the logical vocabulary employed in NLU/NLG so that it meshessmoothly with both the planning operators and with the surfacerealization schemas aimed at the target domain. (As a noteworthyexample of this genre, see Moore & Paris 1993.) In this waydomain andtext planning and surface realization become relativelystraightforward, at least in comparison with systems that attempt tosynthesize plans from scratch, or to reason extensively about theworld, the interlocutor, the context, and the best way to express anidea at the surface level. But while such anapproach is entirely defensible for an experimental system intended toillustrate the role of plans and intentions in a specialized domain, itleaves open the question of how large amounts of linguistic knowledgeand world knowledge could be incorporated into a dialogue system, andusedinferentially inplanning communicative (and other) actions.

Another strategy for achieving more nearly practical performance hasbeen to precode (and to some extent learn) more “reactive” (as opposedto deliberative) ways of participating in dialogue. Reactive techniquesinclude (i) formulaic, schema-based responses (reminiscent of ELIZA)where such responses are likely to be appropriate; (ii) rule-basedintention and plan recognition; e.g.,an automated travel agentconfronted with the elliptical input “Flights to Orlando” can usuallyassume that the user wishes to be provided with flight options from theuser's current city to Orlando, in a time frame that may have beenestablished previously; (iii) statistical domain plan recognition basedon probabilistic modeling of the sequences of steps typically taken inpursuit of the goals characteristic of the domain; and (iv) discoursestate modeling by classifying speech acts (or utterance acts) anddialogue states into relatively small numbers of types, and viewingtransitions between dialogue states as events determined by the currentstate and current type of speech act. For example, in a state where thedialogue system has no immediate obligation, and the user asks aquestion, the system assumes the obligation of answering the question,and transitions to a state where it will try to discharge thatobligation.

However, systems that rely primarily on reactive techniques tend tolack deep understanding and behavioral flexibility. Essentially,knowledge-based inference and planning are replaced by rote behavior,conditioned by miscellaneous features of the current discourse stateand observations in the world. Furthermore, deliberate reasoning andplan synthesis seem necessary for an agent that can acquire effectivegoal-directed plans and behaviors autonomously. Although random trialand error (as in reinforcement learning), supervised learning, andlearning by imitation are other learning options, their potential islimited. Random trial and error is apt to be impractical in theenormously large state space of linguistic and commonsense behavior;supervised learning (of appropriate choices based on contextualfeatures) seems at best able to induce rote plan recognition anddiscourse state transitions (reactive behaviors of types (iii) and(iv) above); and imitation is possible only when relevant, readilyobservable exemplary behaviors can be presented to thelearner—and by itself, it can lead only to rote, rather thanreasoned, behavior.

Integrating reactive methods with deliberate reasoning and planning maybe enabled in future by treating intentions and actions arrived at byreactive methods as tentative, to be verified and potentially modifiedby more deliberate reasoning if time permits. Excessive reasoningwith iterated modalities could also be avoided with strongerassumptions about the attainment of mutual belief. For example,we might assume that both speaker and hearer spontaneously performforward inferences about the world and about each other's mental statesbased on discourse events and common knowledge, and that such forwardinferences directly become mutual knowledge (on a “likemindedness”assumption), thus shortcutting many modally nested inferences.

8. Acquiring knowledge for language

We have noted the dependence of language understanding and use on vastamounts of shallow and deep knowledge, about the world, about lexicaland phrasal meaning, and about discourse and dialogue structure andconventions. If machines are to become linguistically competent, weneed to impart this knowledge to them.

Ideally, the initial, preprogrammed knowledge of a machine would berestricted to the kinds of human knowledge thought to be innate (e.g.,object persistence, motion continuity, basic models of animacy andmind, linguistic universals, means of classifying/taxonomizing theworld, of organizing events in time, of abstracting from experience,and other such knowledge and know-how). The rest would be learned inhuman-like fashion. Unfortunately, we do not have embodied agents withhuman-like sensory and motor equipment or human-like innate mentalcapabilities; so apart from the simplest sort of verbal learning byrobots such as verbal labeling of objects or actions, or using spatialprepositions or two-word sentences (e.g.,Fleischman and Roy 2005;McClain and Levinson 2007; Cour et al.2008), most current work onknowledge acquisition uses either (1)hand-coding,(2)knowledgeextraction from text corpora, or (3)crowdsourcing coupled with somemethod of converting collected, verbally expressed “facts” to a usableformat. We focus in this section on the acquisition of generalbackground knowledge needed to support language understanding andproduction, leaving discussion of linguistic knowledge acquisition tosection 9.

8.1 Manual knowledge encoding

The best-known manually created body of commonsense knowledge is theCyc or ResearchCyc knowledge base (KB) (Lenat 1995). This contains an ontology of afew hundred thousand concepts and several million facts and rules,backed up by an inference engine. It has been applied to analysis,decision support and other types of projects in business, education andmilitary domains. However, the Cyc ontology and KB contents have beenmotivated primarily by knowledge engineering considerations (often forspecific projects) rather than by application to languageunderstanding, and this is reflected in its heavy reliance on very specific predicates expressed as concatenations of English words,and on higher-order operators. For example, the relation betweenkilling and dying is expressed using the predicateslastSubEvents,KillingByOrganism-Unique, andDying,and relies on a higher-orderrelationrelationAllExiststhat can be expanded into a quantifiedconditional statement. This remoteness from language makes it difficultto apply the Cyc KB to language understanding, especially if the goalis to extract relevant concepts and axioms from this KB and integratethem with concepts and axioms formalized in a more linguisticallyoriented representation (as opposed to adopting the CycL language, CycKB, axioms about English, and inference mechanisms wholesale)(e.g., Conesa et al. 2010).

Other examples of hand-coded knowledge bases are the ComponentLibrary (CLib) (Barker et al. 2001), and a collection of commonsensepsychological axioms by Hobbs and Gordon (2005). CLib provides a broadupper (i.e., high-level) ontology of several hundred concepts, andaxioms about basic actions (conveying, entering,breaking,etc.) and resultant change. However, theframe-basedKleo knowledge representation used in CLibis not close to language, and the coverage of the English lexicon issparse. The psychological axioms of Hobbs and Gordon are naturallynarrow in focus (memories, beliefs, plans, and goals), and it remainsto be seen whether they can be used effectively in conjunction withlanguage-derived logical forms (of the “flat” type favored by Hobbs)for inference in discourse contexts.

Knowledge adaptation from semi-formalized sources can, for example,consist of extracting part-of-speech and subcategorization informationas well as stock phrases and idioms from appropriate dictionaries. Itmay also involve mapping hypernym hierarchies, meronyms (parts), orantonyms, as catalogued in sources like WordNet, into some form usablefor disambiguation and inference. The main limitations of manuallycoded lexical knowledge are its grounding in linguistic intuitionswithout direct consideration of its role in language understanding,and its inevitable incompleteness, given the ever-expanding andshifting vocabulary, jargon, and styles of expression in all livinglanguages.

Besides these sources of lexical knowledge, there are also sources ofworld knowledge in semi-formalized form, such as tabulations andgazetteers of various sorts, and “info boxes” in online knowledgeresources such as Wikipedia (e.g.,the entries for notable personagescontain a box with summary attributes such as date of birth, date ofdeath, residence, citizenship, ethnicity, fields of endeavor, awards,and others). But to the extent that such sources provide knowledge in aregimented, and thus easily harvested form, they do so only for namedentities (such as people, organizations, places, and movies) and a fewentity types (such as biological species and chemical compounds).Moreover, much of our knowledge about ordinary concepts, such as thatof atree or that ofdriving a car, is not easily captured in the formof attribute-value pairs, and is generally not available in that form.

8.2 Knowledge extraction from text

Knowledge extraction from unconstrained text has in recent years beenreferred to aslearning by reading.The extraction method may be eitherdirect or indirect. A direct method takes sentential information fromsome reliable source, such as word sense glosses in WordNet ordescriptive and narrative text in encyclopedias such as Wikipedia, andmaps this information into a (more) formal syntax for expressinggeneric knowledge. Indirect methods abstract (more or less) formalgeneric knowledge from the patterns of language found in miscellaneousreports, stories, essays, weblogs, etc.

Reliably extracting knowledge by the direct method requiresrelatively deep language understanding, and consequently is far from amature technology. Ide and Véronis (1994) provide a survey of earlywork on deriving knowledge from dictionary definitions, and thedifficulties faced by that enterprise. For the most part knowledgeobtained in this way to date has been either low in quantity or inquality (from an inference perspective). More recent work that showspromise is that of Moldovan and Rus (2001), aimed at interpretingWordNet glosses for nominal artifact concepts, and that of Allen etal. (2013), aimed at forming logical theories of small clusters ofrelated verbs (e.g., sleeping, waking up,etc.) byinterpreting their WordNet glosses.

The most actively researched approach to knowledge extraction fromtext in the last two decades has been the indirect one, beginning witha paper by Marti Hearst demonstrating that hyponymy relations could berather simply and effectively discovered by the use of lexicosyntacticextraction patterns (Hearst 1992). For example, extraction patternsthat look for noun phrases separated by “such as” or“and other” will match word sequences like “seabirdssuch as penguins and albatrosses” or “beans, nuts, andother legumes”, leading to hypotheses thatseabird is a hypernym ofpenguin andalbatross, and thatbeans andnuts arehyponyms oflegumes. By looking for known hyponym-hypernympairs in close proximity, Hearst was able to expand the initial set ofextraction patterns and hence the set of hypotheses. Many variantshave been developed since then, with improvements such as automationof the bootstrapping and pattern discovery methods, often with machinelearning techniques applied to selection, weighting and combination oflocal features in the immediate vicinity of the relata ofinterest. Relations other than hyponymy that have been targeted, ofrelevance to language understanding, include part-of relations, causalrelations, and telic relations (such as that the use ofmilk is todrink it).

While knowledge extraction using Hearst-like patterns is narrowlyaimed at certain predetermined types of knowledge, other approachesare aimed at open information extraction (OIE). These seek todiscover a broad range of relational knowledge, in some casesincluding entailments (in a rather loose sense) between differentrelations. An early and quite successful system of this genre was Linand Pantel's DIRT system (Discovery of Inference Rules from Text),which used collocational statistics to build up a database of“inference rules” (Lin and Pantel 2001). An example of arule might be “X finds a solution toYX solvesY.” The statistical techniquesused included clustering of nominals into similarity groups based ontheir tendency to occur in the same argument positions of the sameverbs, and finding similar relational phrases (such as “finds asolution to” and “solves”), based their tendency toconnect the same, or similar, pairs of nominals. Many of the ruleswere later refined by addition of type constraints to the variables,obtained by abstracting from particular nominals via WordNet (Pantelet al. 2007).

An approach to OIE designed for maximum speed is exemplified by theTextRunner system (Banko et al. 2007). TextRunner isextraction pattern-based, but rather than employing patterns tuned toa few selected relations, it uses a range of patterns obtainedautomatically from a syntactically parsed training corpus, weightedvia Bayesian machine learning methods to extract miscellaneousrelations sentence-by-sentence from text. A rather different approach,termed “open knowledge extraction” (OKE), derives logical forms fromparsed sentences, and simplifies and abstracts these so that they willtend to reflect general properties of the world. This is exemplifiedby theKnext system (KNowledge EXtraction from Text)(e.g., Schubert and Tong 2003). For example, the sentence “I read a very informative book about China” allows KNEXT to abstract“factoids” to the effect that a person may occasionally read a book,and that books may occasionally be informative, and may occasionallybe about a country. (Note that the specific references to the speakerand China have been abstracted to classes.) Another interestingdevelopment has been the extraction of script-like sequences ofrelations from large corpora by collocational methods (see Chambersand Jurafsky 2009). For example, the numerous newswire reports aboutarrest and prosecution of criminals can be mined to abstract typicalevent types involved, in chronological order, such as arrest,arraignment, plea, trial, and so on. A difficulty in all of this workis that most of the knowledge obtained is too ambiguously andincompletely formulated to provide a basis for inference chaining (butsee for example Van Durme et al. 2009; Gordon and Schubert2010; Schoenmackers et al. 2010).

8.3 Crowdsourcing

The crowdsourcing approach to the acquisition of general knowledgeconsists of soliciting verbally expressed information, or annotationsof such information, from large numbers of web users, sometimes usingeither small financial rewards or the challenge of participating insimple games as inducements (Havasi et al. 2007; von Ahn2006). Crowdsourcing has proved quite reliable for simpleannotation/classification tasks (e.g., Snow et al.2008; Hoffmann et al. 2009). However, general knowledgeoffered by non-expert users is in general much less carefullyformulated than, say, encyclopedia entries or word sense glosses inlexicons, and still requires natural language processing if formalstatements are to be abstracted. Nevertheless, theOpen Mind Common Sense project has produced arelational network of informal commonsense knowledge (ConceptNet),based on simple English statements from worldwide contributors, thatproved useful for improving interpretation in speech recognition andother domains (Lieberman et al. 2004; Faaborg et al.2005).

The overall picture that emerges is that large-scale resources ofknowledge for language, whether lexical or about the world, stillremain too sparse and too imprecise to allow scaling up ofnarrow-domain NLU and dialogue systems to broad-coverageunderstanding. But such knowledge is expected to prove crucialeventually in general language understanding, and so the quest foracquiring this general knowledge remains intensely active.

9. Statistical NLP

“Allthe thousands of times you've heard clause-final auxiliaryverbs uncontracted strengthen the probability that they're not allowedto contract.”–Geoff Pullum (2011)

We have already referred to miscellaneous statistical models andtechniques used in various computational tasks, such as(insection 2) HMMs in POS tagging,probabilistic grammar modeling and parsing, statistical semantics,semantic disambiguation (word senses, quantifier scope, etc.), planrecognition, discourse modeling, and knowledge extraction fromtext. Here we try to provide a brief, but slightly more systematictaxonomy of the types of tasks addressed in statistical NLP, and somesense of the modeling techniques and algorithms that are most commonlyused and have made statistical NLP so predominant in recent years,challenging the traditional view of computational linguistics.

This traditional view focuses on deriving meaning, and rests on theassumption that the syntactic, semantic, pragmatic, and world knowledgeemployed in this derivation is “crisp” as opposed to probabilistic;i.e., the distributional properties of language are a mere byproduct oflinguistic communication, rather than an essential factor in languageunderstanding, use, or even learning. Thus the emphasis, in thisview, is on formulating nonprobabilistic syntactic, semantic,pragmatic, and KR theories to be deployed in language understanding anduse. Of course, the problem of ambiguity has always been a focal issuein building parsers and language understanding systems, but theprevailing assumption was that ambiguity resolution could beaccomplished by supplementing the interpretive routines with somecarefully formulated heuristics expressing syntactic and semanticpreferences.

However, experience has revealed that the ambiguities that afflictthe desired mappings are far too numerous, subtle, and interrelated tobe amenable to heuristic arbitration. Instead, linguistic phenomenaneed to be treated as effectively stochastic, and the distributionalproperties resulting from these stochastic processes need to besystematically exploited to arrive at reasonably reliable hypothesesabout underlying structure. (The Geoff Pullum quote above is relevantto this point: The inadmissiblity of contracting the first occurrenceofI am toI'm in “I'd rather be hated for whoI am, than loved for who I am not” is not easily ascribed to anygrammatical principle, yet—based on positive evidencealone—becomes part of our knowledge of English usage.) Thus theemphasis has shifted, at least for the time being, to viewing NLP as aproblem of uncertain inference and learning in a stochasticsetting.

This shift is significant from a philosophical perspective, notjust a practical one: It suggests that traditional thinking aboutlanguage may have been too reliant on introspection. The limitation ofintrospection is that very little of what goes on in our brains whenwe comprehend or think about language is accessible to consciousness(see for example the discussion of “two-channelexperiments” in Baars 1997). We consciously registertheresults of our understanding and thinking, apparently insymbolic form, but not the understanding andthinkingprocesses themselves; and these symbolicabstractions, to the extent that they lack quantitative orprobabilistic dimensions, can lead us to suppose that the underlyingprocessing is nonquantitative as well. But the successes ofstatistical NLP, as well as recent developments in cognitive science(e.g., Fine et al. 2013; Tenenbaum et al. 2011; Chater andOaksford 2008) suggest that language and thinking are not onlysymbolic, but deeply quantitative and in particular probabilistic.

For the first twenty years or so, the primary goals in statisticalNLP have been to assign labels, label sequences, syntax trees,or translations to linguistic inputs, using statistical languagemodels trained on large corpora of observed language use. More fully,the types of tasks addressed can be grouped roughly as follows (wherethe appended keywords indicate typical applications):

  • text/document classification: authorship, Reuters news category, sentiment analysis;
  • classification of selected words or phrases in sentential or broader contexts: word sense disambiguation, named entity recognition, multiword expression recognition;
  • sequence labeling: acoustic features → phones → phonemes → words → POS tags;
  • structure assignment to sentences: parsing, semantic role labeling, quantifier scoping;
  • sentence transduction: MT, LF computation;
  • structure assignment to multi-sentence texts: discourse relations, anaphora, plan recognition;
  • large-scale relation extraction: knowledge extraction, paraphrase and entailment relations.

Thesegroups may seem to differ haphazardly, but as we will further discuss,certain techniques and distinctions are common to many of them, notably

  • in modeling: numeric and discrete features, vector models, log-linear models, Markov models; generative versus discriminative models, parametric versus non-parametric models;
  • in learning from data: maximum likelihood estimation, maximum entropy, expectation maximization, dynamic programming; supervised versus unsupervised learning; and
  • in output computation: dynamic programming; unique outputs versus distributions over outputs.

We now try to provide some intuitive insight into the mostimportant techniques and distinctions involved in the seven groups oftasks above. For this purpose, we need not comment further onquantifier scoping (in the fourth group) or any of the items in thesixth and seventh groups, as these are for the most part coveredelsewhere in this article. In all cases, the two major requirementsare the development (aided by learning) of a probabilistic modelrelating linguistic inputs to desired outputs, and the algorithmic useof the model in assigning labels or structures to previously unseeninputs.

Text and document classification: In classifying substantialdocuments, the features used might be normalized occurrencefrequencies of particular words (or word classes) andpunctuation. Especially for shorter texts, various discrete featuresmay be included as well, such as 0, 1-valued functions indicating thepresence or absence of certain key words or structural features. Inthis way documents are represented as numerical vectors, withvalues in a high-dimensional space, with separate classespresumably forming somewhat separate clusters in that space. A varietyof classical pattern recognition techniques are applicable to theproblem of learning to assign new documents (as vectors) to theappropriate class (e.g., Sebestyen 1962; Duda and Hart 1973). Perhapsthe simplest approach (most easily applied when features are binary)is a naïve Bayesian one, which assumes that each class generatesfeature values that are independent of one another. The generativefrequencies are estimated from the training data, and class membershipprobabilities for an unknown document (vector) are computed via Bayes'rule (which can be done using successive updates of the prior classprobabilities). Choosing the class with the highest resultantposterior probability then provides a decision criterion. A commongenerative model for real-valued features, allowing for featureinteractions, views the known members of any given class as a sampleof a multivariate normal (Gaussian) random variable. Learning in thiscase consists of estimating the mean and covariance matrix of eachclass (an example of maximum likelihood estimation).

A traditionaldiscriminative approach, not premised on anygenerative model, involves the computation of hyperplanes thatpartition the clusters of known class instances from one another(optimizing certain metrics involving in-class and between-classvariance); new instances are assigned to the class into whosepartition they fall.Perceptrons provide a related technique,insofar as they decide class membership on the basis of a linearcombination of feature values; their particular advantage is that theycan learn incrementally (by adjusting feature weights) as more andmore training data become available. Another durable discriminativeapproach—not dependent on linear separability of classes—is thek nearest neighbors (kNN) method, which assigns an unknowntext or document to the class that is most prevalent among itsk(e.g., 1–5) nearest neighbors in vector space. While all thepreviously mentioned methods depended on parameter estimation (e.g.,generative probabilities, Gaussian parameters, or coefficients ofseparating planes), kNN uses no such parameters—it isnonparametric; however, finding a suitable measure ofproximity or similarity can be challenging, and errors due tohaphazard local data point configurations in feature space are hard toavoid. Another nonparametric discriminative method worth mentioning isthe use ofdecision trees, which can be learned usinginformation-theoretic techniques; they enable choice of a class byfollowing a root-to-leaf path, with branches chosen via tests onfeatures of a given input vector. A potentially useful property isthat learned decision trees can provide insight into what the mostimportant features are (such insight can also be provided bydimensionality reduction methods). However, decision treestend to converge to nonglobal optima (global optimization is NP-hard),and by splitting data, tend to block modeling of feature interactions;this defect can be alleviated to some extent through the use ofdecision forests.

Having mentioned some of the traditional classification methods, wenow sketch two techniques that have become particularly prominent instatistical NLP since the 1990s. The first, with mathematical rootsdating to the 1950s, ismaximum entropy (MaxEnt), also called(multinomial) logistic regression (e.g., Ratnaparkhi1997). Features in this case are any desired 0, 1-valued (binary)functions of both a given linguistic input and a possible class. (Forcontinuous features, supervised or unsuperviseddiscretization methods may be applied, such as entropy-basedpartitioning into some number of intervals.) Training data provideoccurrence frequencies for these features, and a distribution isderived for theconditional probability of a class, given alinguistic input. (As such, it is a discriminative method.) Asits name implies, this conditional probability function is amaximum-entropy distribution, constrained to conform with the binaryfeature frequencies observed in the training data. Its form (apartfrom a constant multiplier) is an exponential whose exponent is alinear combination of the binary feature values for a given input andgiven class. Thus it is alog-linear model (a distributionwhose logarithm is linear in the features)—a type of model nowprevalent in many statistical NLP tasks. Note that since itslogarithm is a linear combination of binary feature values for anygiven input and any given class, choosing the maximum-probabilityclass for a given input amounts to linear decision-making, much as insome of the classical methods; however, MaxEnt generally providesbetter classification performance, and the classificationprobabilities it supplies can be useful in further computations (e.g.,expected utilities).

Another method important in the emergence and successes ofstatistical NLP is thesupport vector machine (SVM) method(Boser et al. 1992; Cortes and Vapnik 1995). The greatadvantage of this method is that it can in principle distinguisharbitrarily configured classes, by implicitly projecting the originalvectors into a higher- (or infinite-) dimensional space, where theclasses are linearly separable. The projection is mediated by akernel function—a similarity metric on pairs of vectors,such as a polynomial in the dot product of the two vectors. Roughlyspeaking, the components of the higher-dimensional vector correspondto terms of the kernel function, if it were expanded out as a sum ofproducts of the features of the original, unexpanded pair ofvectors. But no actual expansion is performed, and moreover theclassification criterion obtained from a given training corpus onlyrequires calculation of the kernel function for the given featurevector (representing the document to be classified) paired withcertain special “support vectors”, and comparison of a linearcombination of the resulting values to a threshold. The supportvectors belong to the training corpus, and define two parallelhyperplanes that separate the classes in question as much as possible(in the expanded space). (Hence this is a “max-margin” discriminativemethod.) SVMs generally provide excellent accuracy, in part becausethey allow for nonlinear feature interaction (in the original space),and in part because the max-margin method focuses on class separation,rather than conditional probability modeling of the classes. On theother hand, MaxEnt classifiers are more quickly trainable than SVMs,and often provide satisfactory accuracy. General references coveringthe classification methods we have sketched are (Duda et al.2001; Bishop 2006).

Classification of selected words or phrases in sentential orbroader contexts: As noted earlier, examples include WSD, namedentity recognition, and sentence boundary detection. The only point ofdistinction from text/document classification is that it is not achunk of text as a whole, but rather a word or phrase in the contextof such a chunk that is to be classified. Therefore features arechosen to reflect both the features of the target word or phrase (suchas morphology) and the way it relates to its context, in terms of,e.g., surrounding words or word categories, (likely) local syntacticdependency relations, and features with broader scope such as wordfrequencies or document class. Apart from this difference in howfeatures are chosen, the same (supervised) learning and classificationmethods discussed above can be applied. However, sufficientlylarge training corpora may be hard to construct. For example, instatistical WSD (e.g., Yarowsky 1992; Chen et al.2009), since thousands of words have multiple senses in sourcessuch as WordNet, it is difficult to construct asense-annotated training corpus that contains sufficiently manyoccurrences of all of these senses to permit statisticallearning. Thus annotations are typically restricted to thesenses of a few polysemous words, and statistical WSD has been shownto be feasible for the selected words, but broad-coverage WSD toolsremain elusive.

Sequence labeling: There is a somewhat arbitrary line betweenthe preceding task and sequence labeling. For example, it is quitepossible to treat POS tagging as a task of classifying words in a textin relation to their context. However, such an approach fails toexploit the fact that the classifications of adjacent words areinterdependent. For example, in the sentence (from the web) “I don't fish like most people”, the occurrence ofdon'tshould favor classification offish as a verb, which in turnshould favor classification oflike as a preposition. (Atleast such preferences make sense for declarative sentences; replacing‘I’ by ‘why’ would change matters—seebelow.) Such cascaded influences are not easily captured throughsuccessive independent classifications, and they motivate generativesequence models such as HMMs. For POS tagging, a labeled trainingcorpus can supply estimates of the probability of any POS for the nextword, given the POS of the current word. If the corpus is largeenough, it can also supply estimates of word “emission” probabilitiesfor a large proportion of words generally seen in text, i.e.,their probability of occurring, given the POS label. (Smoothingtechniques are used to fill in non-zero probabilities for unknownwords, given a POS.) We previously mentioned theViterbialgorithm as an efficient dynamic programming algorithm forapplying an HMM (trained as just mentioned) to the task of assigning amaximum-probability POS tag sequence to the words of a text. Tworelated algorithms, theforward andbackwardalgorithms, can be used to deriveprobabilities of thepossible labels at each word positioni, which may be moreuseful than the “best” label sequence for subsequent higher-levelprocessing. Theforward algorithm in effect (via dynamicprogramming) sums the probabilities of all label sequences up topositioni that end with a specified labelX at wordpositioni and that generate the input up to (and including)that word. Thebackward algorithm sums the probabilities ofall label sequences that begin with labelX at positioni, and generate the input from positioni+1 to the end. The product of the forwardand backward probabilities, normalized so that the probabilities ofthe alternative labels at positioni sum to 1, give theprobability ofX ati, conditioned on the entireinput.

All learning methods referred to so far have beensupervisedlearning methods—a corpus of correctly labeled texts wasassumed to be available for inferring model parameters. But methodshave been developed forunsupervised (orsemi-supervised) learning as well. An important unsupervisedmethod of discovering HMM models for sequence labeling is theforward-backward (orBaum-Welch) algorithm. A simpleversion of this algorithm in the case of POS tagging relies on alexicon containing the possible tags for each word (which are easilyobtained from a standard lexicon). Some initial, more or lessarbitrarily chosen values of the HMM transition and emissionprobabilities are then iteratively refined based on a training corpus.A caricature of the iterative process would be this: We use thecurrent guesses of the HMM parameters to tag the training corpus; thenwe re-estimate those parameters just as if the corpus werehand-tagged. We repeat these two steps till convergence. The actualmethod used is more subtle in the way it uses the current HMMparameters. (It is a special case ofEMExpectationMaximization.) Rather than re-estimating the parameters based onoccurrence frequencies in the current “best” tag sequence, it uses theexpected number of occurrences of particular pairs ofsuccessive states (labels), dividing this by theexpectednumber of occurrences of the first member of the pair. These expectedvalues are determined by the conditional probability distribution overtag sequences, given the training corpus and the current HMMparameters, and can be obtained using the forward and backwardprobabilities as described above (and thus, conditioned on the entirecorpus). Revised emission probabilities for anyXw can be computed as the sum of probabilities ofX-labels at all positions where wordw occurs in thecorpus, divided by the sum of probabilities ofX-labels atall positions, again using (products of) forward and backwardprobabilities.

Unfortunately EM is not guaranteed to find a globally optimalmodel. Thus good results can be achieved only by starting with a“reasonable” initial HMM, for example assigning very lowprobabilities to certain transitions (such as determiner →determiner, determiner → verb, adjective → verb).Semi-supervised learning might start with a relatively smalllabeled training corpus, and use the corresponding HMM parameterestimates as a starting point for unsupervised learning from further,unlabeled texts.

A weakness of HMMs themselves is that the Markov assumption(independence of non-neighbors, given the neighbors) is violated bylonger-range dependencies in text. For example, in the context of arelative clause (signaled by a noun preceding that clause), atransitive verb may well lack an NP complement ( “I collectedthe money hethrew down on the table.”), and as a result,words following the verb may be tagged incorrectly (down as anoun). A discriminative approach that overcomes this difficulty isthe use ofconditional random fields (CRFs). Like HMMs (whichthey subsume), these allow for local interdependence of hidden states,but employ features that depend not only on adjacent pairs of thesestates, but also on any desired properties of the entire input.Mathematically, the method is very similar to MaxEnt (as discussedabove). The feature coefficients can be learned from training dataeither by gradient ascent or by an incremental dynamic programmingmethod related to the Baum-Welch algorithm, calledimprovediterative scaling (IIS) (Della Pietra et al. 1997; Lafferty etal. 2001). CRFs have been successful in many applications other thanPOS tagging, such as sentence and word boundary detection (e.g., forChinese), WSD, extracting tables from text, named entity recognition,and—outside of NLP—in gene prediction and computervision.

Structure assignment to sentences: The use of probabilisticcontext-free grammars (PCFGs) was briefly discussed insection 2. Supervised learning of PCFGs can beimplemented much like supervised learning of HMMs for POS tagging. Therequired conditional probabilities of phrase expansion are easilyestimated if a large corpus annotated with phrase bracketings (atreebank) is available (though estimates ofPOS → wordexpansion probabilities are best supplemented with additional data).Once learned, a PCFG can be used to assign probabilistically weightedphrase structures to sentences using the chart parsing methodmentioned insection 2—again a dynamic programming method.

Also, unsupervised learning of PCFGs is possible using the EMapproach. This is important, since it amounts to grammardiscovery. The only assumption we start with, theoretically,is that there is some maximum number of nonterminal symbols, and eachcan be expanded into any two nonterminals or into any word (Chomskynormal form). Also we associate some more or less arbitrary initialexpansion probabilities with these rules. The probabilities areiteratively revised using expected values of the frequency ofoccurrence of the possible expansions, based on the current PCFGmodel, conditioned on the corpus. The analogue of the forward-backwardalgorithm for computing these expectations is theinside-outside algorithm. Inside probabilities specify theprobability that a certain proper segment of a given sentence will bederived from a specified nonterminal symbol. Outside probabilitiesspecify the probability thatall but a certain segment of thegiven sentence will be derived from the start (sentence) symbol, wherethat “missing” segment remains to be generated from a specifiednonterminal symbol. The inside and outside probabilities play rolesanalogous to the backward and forward probabilities in HMM learningrespectively. Conceptually, they require summations over exponentiallymany possible parse trees for a given sentence, but in fact insideprobabilities can be computed efficiently by the CYK algorithm(section 2), and outside probabilities can also be computedefficiently, using a top-down recursive “divide and conquer” algorithmthat makes use of previously computed inside probabilities.

Modest successes have been achieved in learning grammars in thisway. The complexity is high (cubic-time in the size of the trainingcorpus as well as in the number of nonterminals), and as noted, EMdoes not in general find globally optimal models. Thus it isimportant to place some constraints on the initial grammar, e.g.,allowing nonterminals to generate either pairs of nonterminals orwords, but not both, and also severely limiting the number of allowednonterminals. A method of preferring small rule sets over large ones,without setting a fixed upper bound, is the use of aDirichletprocess that supplies a probability distribution over theprobabilities of an unbounded number of rules. (This method isnonparametric, in the sense that it does not commit to any fixednumber of building blocks or parameters in the modeling.) Whatevermethod of bounding the rules is used, the initial PCFG must becarefully chosen if a reasonably good, meaningful rule set is to belearned. One method is to start with a linguistically motivatedgrammar and to use “symbol splitting” (also called“state splitting”) to generate variants of nonterminalsthat differ in their expansion rules andprobabilities. Recentspectral algorithms offer a relativelyefficient, and globally optimal, alternative to EM (Cohen etal. 2013), and they can be combined with symbol splitting.

Like HMMs, PCFGs are generative models, and like them suffer frominsufficient sensitivity of local choices to the larger context. CRFscan provide greater context-sensitivity (as in POS tagging and othertypes of sequence labeling); though they are not directly suited tostructure assignment to text, they can be used to learn shallowparsers, which assign phrase types only to nonrecursive phrases (coreNPs, PPs, VPs, etc.) (Sha and Pereira 2003).

In the current grammar-learning context, we should also mentionconnectionist models once more. Such models have shown some capacityfor learning to parse from a set of training examples, but achievingfull-scale parsing in this way remains a challenge. Also acontroversial issue is the capacity of nonsymbolic NNs toexhibitsystematicity in unsupervised learning, i.e.,demonstrating a capacity to generalize from unannotated examples. Thisrequires, for example, the ability to accept or generate sentenceswherein verb arguments appear in positions different from those seenin the training set. According to Brakel and Frank (2009),systematicity can be achieved with simple recurrent networks(SRNs). However, computational demonstrations have generally beenrestricted to very simple, English-like artificial languages, at leastwhen inputs were unannotated word streams.

A structure assignment task that can be viewed as a step towardssemantic interpretation issemantic role labeling(Palmer et al. 2010). The goal is to assign thematic rolessuch asagent, theme, recipient, etc. to core phrases orphrasal heads in relation to verbs (and perhaps othercomplement-taking words). While this can be approached as a sequencelabeling problem, experimental evidence shows that computing parsetrees and using resulting structural features for role assignment (orjointly computing parse trees and roles) improves precision. Afrequently used training corpus for such work is PropBank, a versionof the Penn Treebank annotated with “neutral” rolesarg0, arg1,arg2, etc.

Sentence transduction: The most intensively studied type ofstatistical sentence transduction to date has beenstatisticalMT (SMT) (e.g., Koehn 2010; May 2012). Its successes beginning inthe late 1980s and early 90s came as something of a surprise to theNLP community, which had been rather pessimistic about MT prospectsever since the report by Bar-Hillel (1960) and the ALPAC report(Pierce et al. 1966), negatively assessing the results of amajor post-WW2 funding push in MT by the US government. MT came to beviewed as a large-scale engineering enterprise that would not havebroad impacts until it could be adequately integrated with semanticsand knowledge-based inference. The statistical approach emerged in thewake of successful application of “noisy channel” models to speechrecognition in the late 1970s and during the 80s, and was propelledforward by new developments in machine learning and the increasingavailability of large machine-readable linguistic corpora, includingparallel texts in multiple languages.

The earliest, and simplest type of translation methodwasword-based. This was grounded in the following sort ofmodel of how a foreign-language sentencef (say, inFrench) is generated from an English sentencee (which wewish to recover, if the target language is English): First,eis generated according to some simple model of English, for instanceone based on bigram frequencies. Individual words ofe arethen assumed to generate individual words off with someprobability, allowing for arbitrary word-order scrambling (or biasedin some way). In learning such a model, the possible correspondencesand word-translation probabilities can be estimated from parallelEnglish-French corpora, whose sentences and words havebeenaligned by hand or by statistical techniques. Such amodel can then be used for “decoding” a given Frenchsentencef into an English sentencee by Bayesianinference—we derivee as the English sentence with highestposterior probability, given its French “encoding” asf. Thisis accomplished with dynamic programming algorithms, and might use anintermediate stage where then best choices ofe arecomputed (for some predeterminedn), and subsequentlyre-ranked discriminatively using features ofe andfignored by the generative model.

However, the prevailing SMT systems (such asGoogleTranslate orYahoo! Babel Fish) arephrase-based rather than word-based. Here“phrase” refers to single words or groups of words thattend to occur adjacent to each other. The idea is that phrases aremapped to phrases, for example, the English word pairredwine to French phrasesvin rouge, du vin rouge, orle vin rouge. Also, instead of assuming arbitrary word orderscrambling,reordering models are used, according to which agiven phrase may tend to be swapped with the left or right neighboringphrase or displaced from the neighbors, in the translationprocess. Furthermore, instead of relying directly on a Bayesianmodel, as in the word-based approach, phrase-based approachestypically use a log-linear model, allowing for incorporation offeatures reflecting not only the language model (such as trigramfrequencies), the phrase translation model (such as phrase translationfrequencies), and the reordering model, but also miscellaneousfeatures such as the number of words created, the number of phrasetranslations used, and the number of phrase reorderings (with largerpenalties for larger displacements).

While phrase-based SMT models have been quite successful, they arenonetheless prone to production of syntactically disfluent orsemantically odd translations, and much recent research has sought toexploit linguistic structure and patterns of meaning to improvetranslation quality. Two major approaches to syntactic transferarehierarchical phrase-based translationandtree-to-string (TTS)transductionmodels. Hierarchical phrase-based approaches usesynchronousgrammar rules, which simultaneously expand partial derivations ofcorresponding sentences in two languages. These are automaticallyinduced from an aligned corpus, and the lowest hierarchical layercorresponds to phrase-to-phrase translation rules like those inordinary phrase-based translation. While quite successful, thisapproach provides little assurance that “phrases” in theresulting synchronous grammars are semantically coherent units, in thelinguistic sense. TTS models obtain better coherency through use ofparsers trained on phrase-bracketed text corpora (treebanks). Theencoding of English sentences into French (in keeping with ourpreviously assumed language pair) is conceptualized as beginning witha parsed English sentence, which is then transformed by (learned)rules that progressively expand the original or partially transformedpattern of phrases and words until all the leaves are Frenchwords.

Apart from MT, another important type of sentence transductionissemantic parsing, in the sense of mapping sentences insome domain to logical forms usable for question answering. (Note thatsemantic role labeling, discussed above, can also be viewed as a steptowards semantic parsing.) Several studies in this relatively recentarea have employed supervised learning, based on training corporaannotated with LFs (e.g., Mooney 2007; Zettlemoyer & Collins 2007)or perhaps syntactic trees along with LFs (e.g., Ge and Mooney2009). Typical domains have been QA about geography (where LFs aredatabase queries), about Robocup soccer, or about travelreservations. Even unsupervised learning has been shown to be possiblein restricted domains, such as QA based on medical abstracts (Poon andDomingos 2009) or the travel reservation domain (Poon 2013). Ideasused in this work include forming synonym clusters of nominal termsand verbal relations much as in Lin and Pantel's DIRT system, withcreation of logical names (reflecting their word origins) for theseconcepts and relations; and learning (viaMarkov logic, ageneralization ofMarkov networks) to annotate the nodes ofdependency parse trees with database entities, types, and relations onthe basis of a travel reservation dialogue corpus (where the dataneeded for the travel agent's answers are known to lie in thedatabase). Whether such methods can be generalized to less restricteddomains and forms of language remains to be seen. The recent creationof a general corpus annotated with an “abstract meaningrepresentation”, AMR, is likely to foster progress in that direction(Banarescu et al. 2013).

The topics we have touched on in this section are technicallycomplex, so that our discussion has necessarily been shallow. Generalreferences for statistical language processing are Manning andSchütze 1999 and Jurafsky and Martin 2009. Also thestatistical NLP community has developed remarkably comprehensivetoolkits for researchers, such as MALLET (MAchine Learning forLanguagE Toolkit), which includes brief explanations of many of thetechniques.

What are the prospects for achieving human-like language learningin machines? There is a growing recognition that statistical learningwill have to be linked to perceptual and conceptual modeling of theworld. Recent work in the area ofgrounded language learningis moving in that direction. For example, Kim and Mooney (2012)describe methods of using sentences paired with graph-baseddescriptions of actions and contexts to hypothesize PCFG rules forparsing NL instructions into action representations, while learningrule probabilities with the inside-outside algorithm. However, theyassumed a very restricted domain, and the question remains how far themodeling of perception, concept formation, and of semantic andepisodic memory needs to be taken to support unrestricted languagelearning. As in the case of world knowledge acquisition bymachines (see the preceding section), the modeling capabilities mayneed to achieve equivalence with those of a newborn, allowing forencoding percepts and ideas in symbolic and imagistic languages ofthought, for taxonomizing entity types, recognizing animacy andintentionality, organizing and abstracting spatial relations andcausal chains of events, and more. Providing such capabilities islikely to require, along with advances in our understanding ofcognitive architecture, resolution of the very issues concerning therepresentation and use of linguistic, semantic, and world knowledgethat have been the traditional focus in computationallinguistics.

10. Applications

As indicated at the outset, applications of computationallinguistics techniques range from those minimally dependent onlinguistic structure and meaning, such as document retrieval andclustering, to those that attain some level of competence incomprehending and using language, such as dialogue agents that providehelp and information in limited domains like personal scheduling,flight booking, or help desks, and intelligent tutoring systems. In thefollowing we enumerate some of these applications. In several cases(especially machine translation) we have already provided considerabledetail, but the intent here is to provide a bird's eye view ofthe state of the art, rather than technical elucidations.

With the advent of ubiquitous computing, it has become increasinglydifficult to provide a systematic categorization of NLP applications:Keyword-based retrieval of documents (or snippets) and database accessare integrated into some dialogue agents and many voice-based services;animated dialogue agents interact with users both in tutoring systemsand games; chatbot techniques are incorporated into various useful orentertaining agents as a backends; and language-enabled robots, thoughdistinctive in combining vision and action with language, are graduallybeing equipped with web access, QA abilities, tutorial functions, andno doubt eventually with collaborative problem solving abilities. Thusthe application categories in the subsections that follow, rather thanbeing mutually exclusive, are ever more interwined in practice.

10.1 Machine translation (again)

One of the oldest MT systems is SYSTRAN,which was developed as a rule-based system beginning in the 1960s, andhas been extensively used by US and European government agencies, andalso in Yahoo! Babel Fish and (until 2007) in Google Translate. In 2010, it was hybridized with statistical MT techniques. As mentioned, GoogleTranslate currently uses phrase-based MT, with English serving as aninterlingua for the majority of language pairs. Microsoft's BingTranslator employs dependency structure analysis together withstatistical MT. Other very comprehensive translation systems includeAsia Online and WorldLingo. Many systems for small language groupsexist as well, for instance for translating between Punjabi and Hindi(the Direct MT system), or between a few European languages (e.g.,OpenLogos, IdiomaX, and GramTrans).

Translations remain error-prone, but their quality is usuallysufficient for readers to grasp the general drift of the sourcecontents. No more than that may be required in many cases, such asinternational web browsing (an application scarcely anticipated indecades of MT research). Also, MT applications on hand-held devices,designed to aid international travellers, can be sufficiently accuratefor limited purposes such as asking directions or emergency help,interacting with transportation personnel, or making purchases orreservations, When high-quality translations are required,automatic methods can be used as an aid to human translators, butsubtle issues may still absorb a large portion of a translator's time.

10.2 Document retrieval and clustering applications

Information retrieval has long been a central theme of informationscience, covering retrieval of both structured data such as are foundin relational databases as well as unstructured text documents (e.g.,Salton 1989). Retrieval criteria for the two types of data are notunrelated, since both structured and unstructured data often requirecontent-directed retrieval. For example, while users of anemployee database may wish at times to retrieve employee records bythe unique name or ID of employees, at other times they may wish toretrieve all employees in a certain employment category, perhaps withfurther restrictions such as falling into a certain salary bracket.This is accomplished with the use of “inverted files” thatessentially index entities under their attributes and values ratherthan their identifiers. In the same way, text documents might beretrievedvia some unique label, or they might instead beretrieved in accord with theirrelevance to a certain queryor topic header. The simplest notion of relevance is that thedocuments should contain the terms (words or short phrases) of thequery. However, terms that are distinctive for a document shouldbe given more weight. Therefore a standard measure of relevance, givena particular query term, is thetf–idf (termfrequency–inverse document frequency) for the term, whichincreases (e.g.,logarithmically) with the frequency of occurrences ofthe term in the document but is discounted to the extent that it occursfrequently in the set of documents as a whole. Summing the tf-idf's ofthe query terms yields a simple measure of document relevance.

Shortcomings of this method are first, that it underrates termco-occurrences if each term occurs commonly in the document collection(for instance, for the query “rods and cones of the eye”,co-occurrences ofrods,cones, andeye maywell characterize relevant documents, even though all three termsoccur quite commonly in non-physiological contexts), and second, thatrelevant documents might have few occurrences of the query terms,while containing many semantically related terms. Some of the vectormethods mentioned in connection with document clustering can be usedto alleviate these shortcomings. We may reduce the dimensionality ofthe term-based vector space using LSA, obtaining a much smaller“concept space” in which many terms that tend to co-occurin documents will have been merged into the same dimensions(concept). Thus sharing of concepts, rather than sharing of specificterms, becomes the basis for measuring relevance.

Document clustering is useful when large numbers of documents needto be organized for easy access to topically related items, forinstance in collections of patent descriptions, medical histories orabstracts, legal precedents, or captioned images, often in hierarchicalfashion. Clustering is also useful in exploratory data analysis (e.g.,in exploringtoken occurrences in an unknown language), and indirectly supportsvarious NLP applications because of its utility in improving languagemodels, for instance in providing word clusters to be used for backingoff from specific words in cases of data sparsity.

Clustering is widely used in other areas, such as biological andmedical research and epidemiology, market research and grouping andrecommendation of shopping items, educational research, social networkanalysis, geological analysis, and many others.

Document retrieval and clustering often serve as preliminary stepsin information extraction (IE) or text mining, two overlappingareas concerned with extracting useful knowledge from documents, suchas the main features of named entities (category, roles in relation toother entities, location, dates, etc.) or of particular types ofevents, or inferring rule-like correlations between relational terms(e.g., that purchasing of onetype of product correlates withpurchasing another).

We will not attempt to survey IE/text mining applicationscomprehensively, but the next two subsections, on summarization andsentiment analysis, are subareas of particular interest here because oftheir emphasis on the semantic content of texts.

10.3 Knowledge extraction and summarization

Extracting knowledge or producing summaries from unstructured text are ever moreimportant applications, in view of the deluge of documents issuingforth from news media, organizations of every sort, and individuals.This unceasing stream of information makes it difficult to gain anoverview of the items relevant to some particular purpose, such asbasic data about individuals, organizations and consumer products, orthe particulars of accidents, earthquakes, crimes, company take-overs,product maintenance and repair activities, medical research results,and so on.

One commonly used method in both knowledge extraction and certaintypes of “rote” summarization relies on the use ofextraction patterns; these are designed to match the kinds ofconventional linguistic patterns typically used by authors to expressthe information of interest. For example, text corpora ornewswire might be mined for information about companies, by keying inon known company names and terms such as “Corp.”, “.com”, “headquartered at”, and “annualrevenue of”, as well as parts of speech and dependencyrelations, and matching regular-expression patterns against local textsegments containing key phrases or positioned close to them. Asanother example, summarization of earthquake reports might extractexpected information such as the epicenter of the quake, its magnitudeon the Richter scale, the time and duration of the event, affectedpopulation centers, extent of death tolls, injuries, and propertydamage, consequences such as fires and tsunamis, etc. Extractionpatterns can usually be thought of as targeting particular attributesin predetermined attribute-value frames (e.g., a frame for companyinformation or a frame for facts about an earthquake), and thefilled-in frames may themselves be regarded as summaries, or may beused to generate natural-language summaries. Early systems of thistype were FRUMP (DeJong 1982) and JASPER (Andersen et al. 1992). Amongthe hundreds of more modern extraction systems, a particularlysuccessful one in competitions has been SRI's“Fastus” (Hobbs et al. 1997).

Note that whether a pattern-based system is viewed as a knowledgeextraction system or summarization system depends on the text it isapplied to. If all the information of interest is bundled together in asingle, extended text segment (as in the case of earthquake reports),then the knowledge extracted can be viewed as a summary of the segment.If instead the information is selectively extracted from miscellaneoussentences scattered through large text collections, with most of thematerial being ignored as irrelevant to the purposes of extraction,then we would view the activity of the system as information extractionrather than summarization.

When a document to be summarized cannot be assumed to fall intosome predictable category, with the content structured and expressedin a stereotyped way, summarization is usually performed by selectingand combining “central sentences” from the document. Asentence is central to the extent that many other sentences in thedocument are similar to it, in terms of shared word content or somemore sophisticated similarity measure such as one based on the tf-idfmetric for terms, or a cosine metric in a dimensionality-reducedvector space (thus it is as if we were treating individual sentencesas documents, and finding a few sentences whose“relevance” to the remaining sentences ismaximal). However, simply returning a sequence of central sentenceswill not in general yield an adequate summary. For example, suchsentences may contain unresolved pronouns or other referringexpressions, whose referents may need to be sought in non-centralsentences. Also, central “sentences” may actually beclauses embedded in lengthier sentences that contain unimportantsupplementary information. Heuristic techniques need to be applied toidentify and excise the extra material, and extracted clauses need tobe fluently and coherently combined. In other cases, complexdescriptions should be more simply and abstractly paraphrased. Forexample, an appropriate condensation of a sentence such as “Thetornado carried off the roof of a local farmhouse, and reduced itswalls and contents to rubble” might be “The tornadodestroyed a local farmhouse.” But while some of these issues arepartially addressed in current systems, human-like summarization willrequire much deeper understanding than is currentlyattainable. Another difficulty in this area (even more so than inmachine translation) is the evaluation of summaries. Even humanjudgments differ greatly, depending, for instance, on the sensitivityof the evaluator to grammatical flaws, versus inadequacies incontent.

10.4 Sentiment analysis

Sentiment analysis refers to the detection of positive or negativeattitudes (or more specific attitudes such as belief or contempt) onthe part of authors of articles or blogs towards commercial products,films, organizations, persons, ideologies, etc. This has become a veryactive area of applied computational linguistics, because of itspotential importance for product marketing and ranking, social networkanalysis, political and intelligence analysis, classification ofpersonality types or disorders based on writing samples, and otherareas. The techniques used are typically based on sentiment lexiconsthat classify the affective polarity of vocabulary items, and onsupervised machine learning applied to texts from which word andphrasal features have been extracted and that have been hand-labeledas expressing positive or negative attitudes towards sometheme. Instead of manual labeling, existing data can sometimes be usedto providea priori classification information. For example,average numerical ratings of consumer products or movies produced bybloggers may be used to learn to classify unrated materials belongingto the same or similar genres. If fact, affective lexical categoriesand contrast relations may be learnable from such data; for example,frequent occurrences of phrases such asgreat movieorpretty good movie orterrible movie in blogsconcerning movies with high, medium, and low average ratings may wellsuggest thatgreat,pretty good, andterrible belong to a contrast spectrum ranging from a verypositive to a very negative polarity. Such terminological knowledgecan in turn boost the coverage of generic sentiment lexicons. However,sentiment analysis based on lexical and phrasal features has obviouslimitations, such as obliviousness to sarcasm and irony ( “Thisis the most subtle and sensitive movie since The Texas ChainsawMassacre”), quotation of opinions contrasting with the author's(“According to the ads, Siri is the greatest app since iTunes,but in fact …”), and lack of understanding of entailments(“You'll be much better off buying a pair of woolen undies forthe winter than purchasing this item”). Thus researchers areattempting to integrate knowledge-based and semantic analysis withsuperficial word- and phrase-based sentiment analysis.

10.5 Chatbots and companionable dialogue agents

Current chatbots are the descendants of Weizenbaum's ELIZA(seesection 1.2), and are typically used (often with an animated“talking head” character) for entertainment, or to engagethe interest of visitors to the websites of certain“dotcoms”. They may be equipped with large hand-craftedscripts (keyword-indexed input-response schemas) that enable them toanswer simple inquiries about the company and their products, with someability to respond to miscellaneous topics and to exchange greetingsand pleasantries. A less benign application is the use of chatbotsposing as visitors to social network sites, or interactive game sites,with the aim of soliciting private information from unwitting humanparticipants, or recommending websites or products to them. As aresult, many social networking sites have joined other bot-targetedsites in using CAPTCHAS to foil bot entry.

Companionable dialogue agents (also calledrelational agents) haveso far relied rather heavily on chatbot techniques, i.e., authoredinput patterns and corresponding outputs. But the goal is to transcendthese techniques, creating agents (often with talking heads or otheranimated characters) with personality traits and capable of showingemotion and empathy; they should have semantic and episodic memory,learning about the user over the long term and providing services tothe user. Those services might include, besides companionship andsupport: advice in some areas of life, health and fitness, schedulemaintenance, reminders, question answering, tutoring (e.g., inlanguages), game playing, and internet services. Yorick Wilks hassuggested that ideally such characters would resemble “Victoriancompanions”, with such characteristics as politeness, discretion,modesty, cheerfulness, and well-informedness (Wilks 2010).

However, such goals are far from being achieved, as speechrecognition, language understanding, reasoning and learning are notnearly far enough advanced. As a noteworthy example of the state ofthe art, we might mention the HWYD (“How Was Your Day”)system of Pulman et al. (2010), which won a best demonstration prizeat an autonomous agents conference. The natural language processing inthis system is relatively sophisticated. Shallow syntactic andsemantic processing is used to find instantiations of some 30“event templates”, such as ones for “argument atwork betweenX andY,” or “meetingwithX aboutY”. The interpretation processincludes reference and ellipsis resolution, relying on an informationstate representation maintained by the dialogue manager. Goalsgenerated by the dialogue manager lead to responses via planning,which involves instantiation and sequencing of response paradigms. Theauthors report the system's ability to maintain consistent dialoguesextending over 20 minutes.

Systems of a rather different sort, aimed at clinicallywell-founded health counseling, have been under development aswell. For example, the systems described in (Bickmore etal. 2011) rely on an extensive, carefully engineeredformalization of clinically proven counseling strategies andknowledge, expressed within a description logic (OWL) and agoal-directed task description language. Such systems have proved toperform in a way comparable to human counselors. However, thoughdialogues are plan-driven, they ultimately consist of scripted systemutterances paired with multiple-choice lists of responses offered tothe client.

Thus companionable systems remain very constrained in the dialoguethemes they can handle, their understanding of language, and their ability to bring extensive general knowledge to a conversation,let alone to use such knowledge inferentially.

10.6 Virtual worlds, games, and interactive fiction

Text-based adventure (quest) games, such as Dungeons and Dragons, Huntthe Wumpus (in its original version), and Advent began to be developedin the early and middle 1970s, and typically featured textualdescriptions of the setting and challenges confronting the player, andallowed for simple command-line input from the player to selectavailable actions (such as “open box”, “takesword” or “read note”). While the descriptions ofthe settings (often accompanied by pictures) could be quite elaborate,much as in adventure fiction, the input options available to theplayer were, and have largely remained, restricted to simpleutterances of the sort that can be anticipated or collected inpre-release testing by the game programmers, and for which responsescan be manually prepared. Certainly more flexible use of NL (“fend off the gremlin with the sword!”, “If I giveyou the gold, will you open the gate for me?”) would enliven theinteraction between player and the game world and the characters init. In the 1980s and 90s text-based games declined in favor of gamesbased primarily on graphics and animation, though an onlineinteractive fiction community grew over the years that drove theevolution of effective interactive fiction development software. Ahighly touted program (in the year 2000) was Emily Short's‘Galatea’, which enabled dialogue with an animatedsculpture. However, this is still an elaborately scripted program,allowing only for inputs that can be heuristically mapped to one ofvarious preprogrammed responses. Many games in this genre also makeuse of chatbot-like input-output response patterns in order to gain ameasure of robustness for unanticipated user inputs.

The most popular PC video games in the 1990s and beyond were Robyn andRand Miller's Myst, a first-person adventure game, and Maxis Software'sThe Sims, a life-simulation game. Myst, though relying on messages inbooks and journals, was largely nonverbal, and The Sims' chiefdeveloper, Will Wright, finessed the problem of natural languagedialogue by having the inhabitants of SimCity babble in Simlish, anonsense language incorporating elements of Ukrainian, French andTagalog.

Commercial adventure games and visual novels continue to rely onscripted dialogue trees—essentially branching alternativedirections in which the dialogue can be expected to turn, withELIZA-like technology supporting the alternatives. More sophisticatedapproaches to interaction between users and virtual characters areunder development in various research laboratories, for example at theCenter for Human Modeling and Simulation at the University ofPennsylvania, and the USC-affiliated Institute for CreativeTechnologies. While the dialogues in these scenarios are still basedon carefully designed scripts, the interpretation of the user's spokenutterances exploits an array of well-founded techniques in speechrecognition, dialogue management, and reasoning. Ongoing research canbe tracked at venues such as IVA (Intelligent Virtual Agents), AIIDE(AI and Interactive Digital Entertainment), and AAMAS (AutonomousAgents and Multiagent Systems).

10.7 Natural language user interfaces

The topic of NL user interfaces subsumes a considerable variety ofNL applications, ranging from text-based systems minimally dependent onunderstanding to systems with significant comprehension and inferencecapabilities in text- or speech-based interactions. The followingsubsections briefly survey a range of traditional and currentapplications areas.

Text-based question answering

Text-based QA is practical to the extent that the types ofquestions being asked can be expected to have ready-made answerstucked away somewhere in the text corpora being accessed by the QAsystem. This has become much more feasible in this age of burgeoninginternet content than a few decades ago, though questions still needto be straightforward, factual ones (e.g., “Who killed President Lincoln?”) rather than ones requiring inference (e.g., “In what century did Catherine the Great live?”, let alone“Approximately how many 8-foot 2-by-4s do I need to build a4-foot high, 15-foot long picket fence?”).

Text-based QA begins with question classification (e.g., yes-noquestions, who-questions, what-questions, when-questions, etc.),followed by information retrieval for the identified type of question,followed by narrowing of the search to paragraphs and finally sentencesthat may contain the answer to the question. The successive narrowingtypically employs word and other feature matching, and ultimatelydependency and role matching, and perhaps limited textual inference toverify answer candidates. Textual inference may, for instance, useWordNet hypernym knowledge to try to establish that a given candidateanswer sentence supports the truth of the declarative version of thequestion. Since the chosen sentence(s) may contain irrelevant materialand anaphors, it remains to extract the relevant material (which mayalso include supporting context) and generate a well-formed,appropriate answer. Many early text-based QA systems up to 1976 arediscussed in Bourne & Hahn 2003. Later surveys (e.g., Maybury2004) have tended to include the full spectrum of QA methods, but TRECconference proceedings (https://trec.nist.gov/) feature numerous paperson implemented systems for text-based QA.

In open-domain QA, many questions are concerned with properties ofnamed entities, such as birth date, birth place, occupation, and otherpersonal attributes of well-known present and historical individuals,locations, ownership, and products of various companies, facts aboutconsumer products, geographical facts, and so on. For answering suchquestions, it makes sense to pre-assemble the relevant factoids into alarge knowledge base, using knowledge acquisition methods like those insection 8. Examples of systems containing an abundance of factoidsabout named entities are several developed at the University ofWashington, storing factoids as text fragments, and varioussystems that map harvested factoids into RDF(Resource Description Framework) triples (see references inOther Internet Resources). Someof these systems obtain their knowledge not only from open informationextraction and targeted relation extraction, but also from such sourcesas Wikipedia “infoboxes” and (controlled) crowdsourcing. Here we arealso stretching the notion of question answering, since several of thementioned systems require the use of key words or query patterns forretrieval of factoids.

From a general user perspective, it is unclear how much addedbenefit can be derived from such constructed KBs, given the remarkableability of Google and other search engines to provide rapid answerseven to such questions as “Which European countries arelandlocked?” (typed without quotes—with quotes, Googlefinds the top answer using True Knowledge), or “How many SupremeCourt justices did Kennedy appoint?” Nonetheless, both Googleand Microsoft have recently launched vast “knowledgegraphs” featuring thousands of relations among hundreds ofmillions of entities. The purpose is to provide direct answers (ratherthen merely retrieved web page snippets) to query terms and naturallanguage questions, and to make inferences about the likely intent ofusers, such as purchasing some type of item or service.

Database front-ends

Natural-language front ends for databases have long been considered anattractive application of NLP technology, beginning with such systemsas LUNAR (Woods et al. 1972) and REL (Thompson et al. 1969; Thompson& Thompson 1975). The attractiveness lies in the fact thatretrieval and manipulation of information from a relational (or otheruniformly structured) database can be assumed to be handled by anexisting db query language and process. This feature sharply limitsthe kinds of natural language questions to be expected from a user,such as questions aimed at retrieving objects or tuples of objectssatisfying given relational constraints, or providing summary orextremal properties (longest rivers, lowest costs, and the like) aboutthem. It also greatly simplifies the interpretive process andquestion-answering, since the target logical forms—formal dbqueries—have a known, precise syntax and are executedautomatically by the db management system, leaving only the work ofdisplaying the computed results in some appropriate linguistic,tabular or graphical form.

Numerous systems have been built since then, aimed at applicationssuch as navy data on ships and their deployment(Ladder: Hendrix et al. 1978),land-use planning (Damerau 1981), geographic QA(Chat-80: Pereira & Warren 1982),retrieval of company records and product records for insurancecompanies, oil companies, manufacturers, retailers, banks,etc. (Intellect: Harris 1984), compilationof statistical data concerning customers, services,assets,etc., of a company (Cercone et al. 1993), andmany more (e.g., see Androutsopoulos & Ritchie 2000). However, thecommercial impact of such systems has remained scant, because theyhave generally lacked the reliability and some of the functionalitiesof traditional db access.

Inferential (knowledge-based) question answering

We have noted certain limited inferential capabilities in text-basedQA systems and NL front ends for databases, such as the ability toconfirm entailment relations between candidate answers and questions,using simple sorts of semantic relations among the terms involved, andthe ability to sort or categorize data sets from databases and computeaverages or even create statistical charts.

However, such limited, specialized inference methods fall far shortof the kind of general reasoning based on symbolic knowledge that haslong been the goal in AI question answering. One of the earliestefforts to create a truly inferential QA system was the ENGLAW projectof L. Stephen Coles (Coles 1972). ENGLAW was intended as a prototypeof a kind of system that might be used by scientists and engineers toobtain information about physical laws. It featured a KB of axioms (infirst-order logic) for 128 important physical laws, manually codedwith the aid of a reference text. Questions (such as “In thePeltier Effect, does the heat developed depend on the direction of theelectric current?”) were rendered into logic via atransformational grammar parser, and productions (aided by variousLisp functions) that map phrase patterns to logical expressions. Thesystem was not developed to the point of practical usefulness, but itsintegration of reasoning and NLP technologies and its methods ofselectively retrieving axioms for inferential QA were noteworthycontributions.

An example of a later larger-scale system aimed at practical goalswas BBN's JANUS system (Ayuso et al.1990). This was intended for navalbattle management applications, and could answer questions about thelocations, readiness, speed and other attributes of ships, allowing forchange with the passage of time. It mapped English queries to a veryexpressive initial representation language with an “intension” operatorto relate formulas to times and possible worlds, and this was in turnmapped into the NIKL description logic, which proved adequate for themajority of inferences needed for the targeted kinds of QA.

Jumping forward in time, we take note of the web-basedWolfram|Alpha (or WolframAlpha) answer engine, developed by WolframResearch and consisting of 15 million lines of Mathematica codegrounded in curated data bases, models, and algorithms for thousandsof different domains. (Mathematica is a mathematically orientedhigh-level programming language developed by the British scientistStephen Wolfram.) The system is tilted primarily towards quantitativequestions (e.g., “What is the GDP of France?”, or“What is the surface area of the Moon?”) and oftenprovides charts and graphics along with more direct answers. Theinterpretation of English queries into functions applied to variousknown objects is accomplished with the pattern matching and symbolmanipulation capabilities of Mathematica. However, the comprehensionof English is not particularly robust at the time of writing. Forexample “How old was Lincoln when he died?”, “Atwhat age did Lincoln die?” and other variants were notunderstood, though in many cases of misunderstanding, Wolfram|Alphadisplays enough retrieved information to allow inference of ananswer. A related shortcoming is that Wolfram|Alpha's quantitativeskills are not supplemented with significant qualitative reasoningskills. For example, “Was Socrates a man?” (again, at thetime of writing) prompts display of summary information aboutSocrates, including an image, but no direct answer to thequestion. Still, Wolfram|Alpha's quantitative abilities are not onlyinteresting in stand-alone mode, but also useful as augmentations ofsearch engines (such as Microsoft Bing) and of voice-based personalassistants such as Apple's Siri (see below).

Another QA system enjoying wide recognition because of itstelevised victory in the Jeopardy! quiz show is IBM's“Watson” (Ferrucci 2012; Ferrucci et al. 2010; Baker2011). Like Wolfram|Alpha, this is in a sense a brute force program,consisting of about a million lines of code in Java, C++, Prolog andother languages, created by a core team of 20 researchers and softwareengineers over the course of three years. The program runs 3000processes in parallel on ninety IBM Power 750 servers, and has accessto 200 million pages of content from sources such as Wordnet,Wikipedia (and its structured derivatives YAGO and DBpedia), thesauri,newswire articles, and literary texts, amounting to several terabytesof human knowledge. (This translates into roughly 1010clausal chunks—a number likely to be around 2 orders ofmagnitude greater than the number of basic facts over which any onehuman being disposes.)

Rather than relying on any single method of linguistic or semanticanalysis, or method of judging relevance of retrieved passages andtextual “nuggets” therein, Watson applies multiple methods to thequestions and candidate answers, including methods of questionclassification, focal entity detection, parsing, chunking, lexicalanalysis, logical form computation, referent determination, relationdetection, temporal analysis, and special methods for question-answerpairs involving puns, anagrams, and other twists common in Jeopardy!.Different question analyses are used separately to retrieve relevantdocuments, and to derive, analyze and score potential answers frompassages and sentences in those documents. In general, numerouscandidate answers to a question are produced, and their analysesprovide hundreds of features whose weights for obtaining ranked answerswith corresponding confidence levels are learned by ML methods appliedto a corpus of past Jeopardy! questions and answers (or officially,answers and questions, according to the peculiar conceit of theJeopardy! protocol). Watson's wagers are based on the confidence levelsof its potential answers and a complex regression model.

How well does Watson fit under our heading of inferential,knowledge-based QA? Does it actuallyunderstand the questionsand the answers it produces? Despite its impressive performanceagainst Jeopardy! champions, Watson reasons, and understands Englishin only very restricted senses. The program exploits the fact that thetarget of a Jeopardy! question is usually a named entity, suchasJimmy Carter, Islamabad, orBlack Hole ofCalcutta, though other types of phrases are occasionallytargeted. Watson is likely to find multiple sentences that mention aparticular entity of the desired type, and whose syntactic andsemantic features are close to the features of the question, therebymaking the named entity a plausible answer without real understandingof the question. For example, a “recent history” questionasking for the president under whom the US gave full recognition toCommunist China (Ferrucci 2012) might well zero in on such sentencesas

Althoughhe was the president who restored full diplomaticrelations with China in 1978, Jimmy Carter has never visited that country…(New York Times, June 27, 1981)

or

Exchangesbetween the two countries' nuclear scientists had begunsoon after President Jimmy Carter officially recognized China in 1978.(New York Times, Feb. 2, 2001)

While the links between such sentences and the correct answer areindirect (e.g., dependent on resolvinghe andwho toJimmy Carter, and associatingrestored diplomaticrelations withrecognized, andCommunist ChinawithChina), correct analysis of those links is not arequirement for success—it is sufficient for the cluster ofsentences favoring the answer Jimmy Carter (in virtue of their wordand phrasal content and numerous other features) to provide a largernet weight to that answer than any competing clusters. This type ofstatistical evidence combination based on stored texts seems unlikelyto provide a path to the kind of understanding that even first-gradersbetray in answering simple commonsense questions, such as “Howdo people keep from getting wet when it rains?”, or “Ifyou eat a cookie, what happens to the cookie?” At the same time,vast data banks utilized in the manner of Watson can make up forinferential weakness in various applications, and IBM is activelyredeveloping Watson as a resource for physicians, one that should beable to provide diagnostic and treatment possibilities that evenspecialists may not have at their fingertips. In sum, however, thegoal of open-domain QA based on genuine understanding andknowledge-based reasoning remains largely unrealized.

Voice-based web services and assistants

Voice-based services, especially on mobile devices, are a rapidlyexpanding applications area. Services range from organizers (forgrocery lists, meeting schedules, reminders, contact lists, etc.), toin-car “infotainment” (routing, traffic conditions, hazard warnings,iTunes selection, finding nearby restaurants and other venues, etc.),to enabling use of other miscellaneous apps such as email dictation,dialing contacts, financial transactions, reservations and placement oforders, Wikipedia access, help-desk services, health advising, andgeneral question answering. Some of these services (such asdialing and iTunes selection) fall into the category of hands-freecontrols, and such controls are becoming increasingly important intransport (including driverless or pilotless vehicles), logistics(deployment of resources), and manufacturing. Also chatbottechnology and companionable dialogue agents (as discussed insection 10.5) are serving as general backends to more specific voice-basedservices.

The key technology in these services is of course speechrecognition, whose accuracy and adaptability has been graduallyincreasing. The least expensive, narrowly targeted systems(e.g., simple organizers) exploit strong expectations about userinputs to recognize, interpret and respond to those inputs; as suchthey resemble menu-driven systems. More versatile systems, such as cartalkers that can handle routing, musical requests, searches forvenues, etc., rely on more advanced dialogue managementcapabilities. These allow for topic switches and potentially for theattentional state of the user (e.g., delaying answering a driver'squestion if the driver needs to attend to a turn). The greatestcurrent “buzz” surrounds advanced voice-based assistants, notablyiPhone's Siri (followed by Android's Iris, True Knowledge's Evi,Google Now, and others). While previous voice control and dictationsystems, like Android's Vlingo, featured many of the samefunctionalities, Siri adds personality and improved dialogue handlingand service integration—users feel that they are interactingwith a lively synthetic character rather than an app. Besides NuanceSR technology, Siri incorporates complex techniques that were to someextent pushed forward by theCalo (Cognitive Assistantthat Learns and Organizes) project carried out by SRI Internationaland multiple universities from 2003–2008 (Ambite et al. 2006;CALO [seeOther Internet Resources]). These techniques include aspects of NLU, ML, goal-directed anduncertain inference, ontologies, planning, and service delegation. Butwhile delegation to web services, including Wolfram|Alpha QA, orchatbot technology provides considerable robustness, and there issignificant reasoning about schedules, purchasing and other targetedservices, general understanding is still very shallow, as users soondiscover. Anecdotal examples of serious misunderstandings are“Call me an ambulance” eliciting the response “Fromnow on I will call you ‘an ambulance’”. However, thestrong interest and demand in the user community generated by theseearly (somewhat) intelligent, quite versatile assistants is likely tointensify and accelerate research towards ever more life-like virtualagents, with ever more understanding and common sense.

10.8 Collaborative problem solvers and intelligent tutors

We discuss collaborative problem solving systems (also referred to as“mixed-initiative” or “task-oriented” dialoguesystems) and tutorial dialogue systems (i.e., tutorial systems inwhich dialogue plays a pivotal role) under a common heading becauseboth depend on rather deep representations or models of the domainsthey are aimed at as well as the mental state of the users theyinteract with.

However, we should immediately note that collaborative problem solvingsystems typically deal with much less predictable domain situationsand user inputs than tutorial systems, and accordingly the formerplace much greater emphasis on flexible dialogue handling than thelatter. For example, collaborators in emergency evacuation (Fergusonand Allen 1998, 2007) need to deal with a dynamically changing domain,at the same time handling the many dialogue states that may occur,depending on the participants' shared and private beliefs, goals,plans and intentions at any given point. By contrast, in a domain suchas physics tutoring (e.g., Jordan et al. 2006; Litman andSilliman 2004), the learner can be guided through a network oflearning goals with authored instructions, and corresponding to thosegoals, finite-state dialogue models can be designed that classifystudent inputs at each point in a dialogue and generate a preparedresponse likely to be appropriate for that input.

It is therefore not surprising that tutorial dialogue systems arecloser to commercial practicality, with demonstrated learning benefitsrelative to conventional instruction in various evaluations, thancollaborative problem solving systems for realistic applications.Tutorial dialogue systems have been built for numerous domains andpotential clienteles, ranging from K-12 subjects to computer literacyand novice programming, qualitative and quantitative physics, circuitanalysis, operation of machinery, cardiovascular physiology, firedamage control on ships, negotiation skills, and more (e.g., see Boyeret al. 2009; Pon-Barry et al. 2006). Among the most successfultutorial systems are reading tutors (e.g., Mostow and Beck 2007; Coleet al. 2007), since the materials presented to the learner (in a“scaffolded” manner) are relatively straightforward todesign in this case, and the responses of the learner, especially whenthey consist primarily of reading presented text aloud, are relativelyeasy to evaluate. For the more ambitious goal of fostering readingcomprehension, the central problem is to design dialogues so as tomake the learner's contributions predictable, while also making theinteraction educationally effective (e.g., Aist and Mostow 2009).

Some tutoring systems, especially ones aimed at children, useanimated characters to heighten the learner's sense ofengagement. Such enhancements are in fact essential for systemsaimed at learners with disabilities like deafness (where mouth andtongue movements of the virtual agent observed by the learner can helpwith articulation), autism, or aphasia (Massaro et al. 2012; Cole etal. 2007). As well, if tutoring is aimed specifically attraininginterpersonal skills, implementation of life-like characters (virtualhumans) becomes an indispensable part of system development (e.g., Coreet al. 2006; Campbell et al. 2011).

Modeling the user's state of mind in tutoring systems is primarily amatter of determining which of the targeted concepts and skills have,or have not yet, been acquired by the user, and diagnosingmisunderstandings that are likely to have occurred, given the sessiontranscript so far. Some recent experimental systems can also adapttheir strategies to the user's apparent mood, such as frustration orboredom, as might be revealed by the user's inputs, tone of voice, oreven facial expressions or gestures analyzed via computer vision. Other prototype systems can be viewed as striving towards more generalmental modeling, by incorporating ideas and techniques fromtask-oriented dialogue systems concerning dialogue states, dialogueacts, and deeper language understanding (e.g., Callaway et al. 2007).

In task-oriented dialogue systems, as already noted, dialoguemodeling is much more challenging, since such systems are expected notonly to contribute to solving the domain problem at hand, but tounderstand the user's utterances, beliefs, and intentions, and to holdtheir own in a human-like, mixed-initiative dialogue. This requiresdomain models, general incremental collaborative planning methods,dialogue management that models rational communicative interaction, andthorough language understanding (especially intention recognition) inthe chosen domain. Prototype systems have been successfully builtfor domains such as route planning, air travel planning, driver andpedestrian guidance, control and operation of external devices,emergency evacuation, and medication advising (e.g., Allen et al. 2006;Rich and Sidner 1998; Bühler and Minker 2011; Ferguson and Allen1998, 2007), and these hold very significant practical promise.However, systems that can deal with a variety of reasonably complexproblems, especially ones requiring broad commonsense knowledge abouthuman cognition and behavior, still seem out of reach at this time.

10.9 Language-enabled robots

As noted at the beginning ofsection 10, robots are beginning to beequipped with web services, question answering abilities, chatbottechniques (for fall-back and entertainment), tutoring functions, andso on. The transfer of such technologies to robots has been slow,primarily because of the very difficult challenges involved in justequipping a robot with the hardware and software needed for basicvisual perception, speech recognition, exploratory and goal-directednavigation (in the case of mobile robots), and object manipulation.However, the keen public interest in intelligent robots and theirenormous economic potential (for household help, eldercare, medicine,education, entertainment, agriculture, industry, search and rescue,military missions, space exploration, and so on) will surely continueto energize the drive towards greater robotic intelligence andlinguistic competence.

A good sense of the state of the art and difficulties in human-robotdialogue can be gained from (Scheutz etal. 2011). Some of the dialogueexamples presented there, concerning boxes and blocks, are reminiscentof Winograd'sshrdlu, but they also exhibit thechallenges involved inreal interaction, such as the changing scenery as the robot moves,speech recognition errors, disfluent and complex multi-clauseutterances, perspective-dependent utterances (“Is the red box to the left of the blue box?”), and deixis (“Go down there”). Inaddition, all of this must be integrated with physical action plannedso as to fulfill the instructions as understood by the robot. While theability of recent robots to handle these difficulties to some degree isencouraging, many open problems remain, such as the problems of speechrecognition in the presence of noise, better, broader linguisticcoverage, parsing, and dialogue handling, adaptation to novel problems,mental modeling of the interlocutor and other humans in theenvironment, and greater general knowledge about the world and theability to use it for inference and planning (both at the domain leveland the dialogue level).

While task-oriented robot dialogues involve all these challenges,we should note that some potentially useful interactions with“talking” robots require little in the way of linguisticskills. For example, theRubi robot described in(Movellan et al. 2009), displayed objects on itsscreen-equipped “chest” to toddlers, asking them to touchand name the objects. This resulted in improved word learning by thetoddlers, despite the simplicity of the interaction. Another exampleof a very successful talking robot with no real linguistic skills wasthe “museum tour guide”Rhino(Burgard et al. 1999). UnlikeRubi it was ableto navigate among unpredictably moving humans, and kept its audienceengaged with its prerecorded messages and with a display of itscurrent goals on a screen. In the same way, numerous humanoid robots(for example, Honda's Asimo) under past and present development acrossthe world still understand very little language and rely mostly onscripted output. No doubt their utility and appeal will continue togrow, thanks to technologies like those mentioned above—games,companionable agent systems, voice-based apps, tutors, and so on; andthese developments will also fuel progress on the deeper aspects ofperception, motion, manipulation, and meaningful dialogue.

Bibliography

  • Aist, G. & J. Mostow, 2009, “Predictable and educationalspoken dialogues: Pilot results,” inProceedings of the 2009ISCA Workshop on Speech and Language Technology in Education(SLaTE 2009). Birmingham, UK: University of Birmingham.[Aist & Mostow 2009 available online (pdf)]
  • Allen, J.F., 1995,Natural Language Understanding,Redwood City: Benjamin/Cummings.
  • Allen J., W. de Beaumont, L. Galescu, J. Orfan, M. Swift, andC.M. Teng, 2013, “Automatically deriving event ontologies for acommonsense knowledge base,” inProceedings of the 10thInternational Conference on Computational Semantics (IWCS2013), University of Potsdam, Germany, March 19–22. Stroudsburg, PA: Association for Computational Linguistics (ACL).[Allen et al. 2013 available online (pdf)]
  • Allen, J., G. Ferguson, N. Blaylock, D. Byron, N. Chambers, M.Dzikovska, L. Galescu, and M. Swift, 2006, “Chester:Towards a personal medical advisor,”BiomedicalInformatics, 39(5): 500–513.
  • Allen, J.F. and C.R. Perreault, 1980, “A plan-based analysis ofindirect speech acts,”Computational Linguistics, 6(3–4): 167–182.
  • Ambite, J.-L., V.K. Chaudhri, R. Fikes, J. Jenkins, S. Mishra, M.Muslea, T. Uribe, and G. Yang, 2006, “Design and implementationof the CALO query manager,”21st National Conference on Artificial Intelligence (AAAI-06), July 16–20, Boston, MA; Menlo Park, CA:AAAI Press, 1751–1758.
  • Andersen, P.M., P.J. Hayes, A.K. Huettner, L.M. Schmandt,I.B. Nirenburg, and S.P. Weinstein, 1992, “Automatic extractionof facts from press releases to generate news stories,” inProceedings of the 3rd Conference on Applied Natural LanguageProcessing (ANLC '92), Trento, Italy, March 31–April3. Stroudsburg, PA: Association for Computational Linguistics(ACL), pp. 170–177. [Andersen et al. 1992 available online (pdf)]
  • Anderson, J., 1983,The Architecture of Cognition, Mahwah, NJ: Lawrence Erlbaum.
  • –––, 1993,Rules of the Mind, Hillsdale, NJ: Lawrence Erlbaum.
  • Anderson, J. & G. Bower, 1973,Human Associative Memory, Washington, DC: Winston.
  • Androutsopoulos, I. and G. Ritchie, 2000, “DatabaseInterfaces,” R. Dale, H. Somers, and H. Moisl (eds.),Handbook of Natural Language Processing, Chapter 9, Boca Raton, FL: CRC Press.
  • Asher, N. and A. Lascarides, 2003,Logics of Conversation(Studies in Natural Language Processing), New York: Cambridge University Press.
  • Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Zachary Ives, 2007, “DBpedia: a nucleus for a web of open data,” inProceedings of the 6th International Semantic Web Conference (ISWC 2007), Nov. 11–15, Busan, Korea.[Auer et al. 2007 available online (pdf)
  • Austin, J.L., 1962,How to Do Things with Words: The William JamesLectures Delivered at Harvard University in 1955, J.O. Urmson (ed.),Oxford: Clarendon Press.
  • Ayuso, D., M. Bates, R. Bobrow, M. Meteer, L. Ramshaw, V. Shaked,and R. Weischedel, 1990, “Research and development in naturallanguage understanding as part of the strategic computingprogram,” BBN Report No. 7191, BBN Systems and Technologies,Cambridge, MA.
  • Baars, B.J., 1997,In the Theater of Consciousness: The Workspaceof the Mind, New York: Oxford University Press.
  • Bach, E., 1976, “An extension of classical transformationalgrammar,” inProceedings of the 1976 Conference on Linguistic Metatheory, Michigan State University, 183–224.[Bach 1976 available online]
  • Bach, E., R. Oehrle, and D. Wheeler (eds.), 1987,CategorialGrammars and Natural Language Structures, Dortrecht: D. Reidel.
  • Baker, S., 2011,Final Jeopardy, Boston: Houghton Mifflin Harcourt.
  • Banarescu, L., C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, K. Koehn, M. Palmer, and N. Schneider, 2013, “Abstract Meaning Representation for Sembanking”, inProceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia,Bulgaria, Aug. 8–9. Stroudsburg, PA: Association for Computational Linguistics (ACL).[Banarescu et al. 2013 available online (pdf)]
  • Banko, M., M.J. Cafarella, S. Soderland, M. Broadhead, andO. Etzioni, 2007. “Open information extraction from the Web,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India, January 6–12.[Banko et al. 2007 available online (pdf)]
  • Bar-Hillel, Y., 1960, “The present status of automatic translationof languages,”Advances in Computers, 1: 91–163.
  • Barker, C., 2004, “Continuations in natural language”, in HayoThielecke (ed.),Proceedings of the 4th ACM SIGPLAN Continuations Workshop (CW'04), Venice, Italy, Jan. 17. Birmingham, UK: University of Birmingham.[Barker 2004 available online]
  • Barker, K., B. Porter, and P. Clark, 2001, “A library of genericconcepts for composing knowledge bases,” inProceedings of the 1st International Conference on Knowledge Capture, Victoria, B.C.,Canada, October 21–23. New York, NY: ACM, pp. 14–21.[Barker, Porter, and Clark 2001 preprint available online]
  • Barnden, J.A., 2001, “Uncertainty and conflict handling in theATT-Meta context-based system for metaphorical reasoning,” inProceedings of the 3rd International Conference on Modeling and Using Context, V. Akman, P. Bouquet, R. Thomason, and R.A. Young (eds.), Lecture Notes in ArtificialIntelligence, Vol. 2116, Berlin: Springer, pp. 15–29.
  • –––, 2006, “Artificial intelligence, figurative language,and cognitive linguistics,” in G. Kristiansenet al. (eds.),Cognitive Linguistics: Current Applications and Future Perspectives,Berlin: Mouton de Gruyter, 431–459.
  • Barwise, J. and R. Cooper, 1981, “Generalized quantifiers andnatural language,”Linguistics and Philosophy, 4(2): 159–219.
  • Barwise, J., and J. Perry, 1983,Situations andAttitudes, Chicago: Universityof Chicago Press.
  • Bengio, Y., 2008, “Neural net language models,”Scholarpedia, 3(1): 3881. [Bengio 2008 available online]
  • Bickmore, T., D. Schulman, and C. Sidner, 2011, “Modeling theintentional structure of health behavior change dialogue,”Journal ofBiomedical Informatics, 44: 183–197.
  • Bishop, C.M., 2006,Pattern Recognition and Machine Learning,New York: Springer.
  • Blutner, R., 2004, “Nonmonotonic inferences and neural networks”,Synthese, 141(2): 143–174.
  • Bobrow, D.G., 1968, “Natural language input for a computerproblem-solving system,” in M. Minsky (ed.),Semantic InformationProcessing, Cambridge, MA: MIT Press, 146–226.
  • Boser, B.E., I.M. Guyon, and V.N. Vapnik, 1992, “A training algorithm for optimal margin classifiers,” in D. Haussler (ed.),5th Annual ACM Workshop on COLT, Pittsburgh, PA: ACM Press, 144–152.
  • Bouaud, J., B. Bachimont, and P. Zweigenbaum, 1996, “Processingmetonymy: a domain-model heuristic graph traversal approach,” inProceedings of the 16th International Conference on Computational Linguistics (COLING'96), Center for Sprogteknologi Copenhagen, Denmark, Aug. 5–9. Stroudsburg, PA: Association for ComputationalLinguistics (ACL), pp. 137–142. [Bouaud, Bachimont, and Zweigenbaum 1996 available online (pdf)]
  • Bourne, C.P. and T.B. Hahn, 2003,A History of Online InformationServices, 1963–1976, Cambridge, MA: MIT Press.
  • Boyer, K.E., E.Y. Ha, M.D. Wallis, R. Phillips, M.A. Vouk, andJ.C. Lester, 2009, “Discovering tutorial dialogue strategies withHidden Markov Models”, inProceedings of the 14th International Conference on Artificial Intelligence in Education (AIED 2009), Brighton, U.K.: IOS Press, pp. 141–148.
  • Brakel, P. and S.L. Frank, 2009, “Strong systematicity insentence processing by simple recurrent networks,” in N.A. Taatgenand H. van Rijn (eds.),Proceedings of the 31st Annual Conference of the Cognitive Science Society, July 30 – Aug. 1, VU University Amsterdam;Red Hook, NY: Curran Associates, Inc., pp. 1599–1604.
  • Brown, J.S. and R.R. Burton, 1975, “Multiple representations ofknowledge for tutorial reasoning,” in D.G. Bobrow and A. Collins(eds.),Representation and Understanding, Academic Press, New York,311–349.
  • Browne, A. and R. Sun, 2001, “Connectionist inferencemodels”,Neural Networks, 14: 1331–1355.
  • Browne, A. and R. Sun, 1999, “Connectionist variable binding”,Expert Systems, 16(3): 189–207.
  • Bühler, D. and W. Minker, 2011,Domain-Level Reasoning forSpoken Dialogue Systems, Boston, MA: Springer.
  • Bunt, H.C., 1985,Mass Terms and Model-Theoretic Semantics,Cambridge, UK and New York: Cambridge University Press.
  • Burgard, W., A.B. Cremers, D. Fox, D. Hahnel, G. Lakemeyer,D. Schulz, W. Steiner, and S. Thrun, 1999, “Experiences with aninteractive museum tour-guide robot,”Articial Intelligence, 114(1–2): 3–55.
  • Bylander, T., 1994, “The computational complexity of propositionalSTRIPS planning,”Artificial Intelligence, 69: 165–204.
  • Callaway, C., M. Dzikovska, E. Farrow, M. Marques-Pita,C. Matheson, and J. Moore, 2007, “The Beetle and BeeDiff tutoringsystems,” inProceedings of the 2007 Workshop on Spoken Language Technology for Education (SLaTE), Farmington, PA, Oct. 1–3. Carnegie Mellon University and ISCA Archive. [Callaway et al. 2007 available online]
  • Campbell, J., M. Core, R. Artstein, L. Armstrong, A. Hartholt,C. Wilson, K. Georgila, F. Morbini, E. Haynes, D. Gomboc, M.Birch, J. Bobrow, H.C. Lane, J. Gerten, A. Leuski, D. Traum,M. Trimmer, R. DiNinni, M. Bosack, T. Jones, R.E. Clark, andK.A. Yates, 2011, “Developing INOTS to support interpersonal skillspractice,” inProceedings of the 32nd Annual IEEE AerospaceConference (IEEEAC), Big Sky, MT, March 5–12, Institute of Electricaland Electronics Engineers (IEEE), 3222–3235.
  • Carbonell, J., 1980, “Metaphor—a key to extensible semanticanalysis,” in N.K. Sondheimer (ed.),Proceedings of the 18th Meeting of the Association for Computational Linguistics (ACL'80), University of Pennsylvania, PA, June 19–22. Stroudsburg, PA: Association for ComputationalLinguistics (ACL), pp. 17–21. [Carbonell 1980 available online (pdf)]
  • Carlson, G.N., 1977,Reference to Kinds in English, Doctoral Dissertation, University of Massachusetts, Amherst, MA. Also New York: Garland Publishing, 1980.
  • –––, 1982, “Generic terms and generic sentences,”Journal of Philosophical Logic, 11: 145–181.
  • –––, 2011, “Generics and habituals,” inC. Maienborn, K. von Heusinger, and P. Portner (eds.),Semantics: AnInternational Handbook of Natural Language Meaning, Berlin: Mouton de Gruyter.
  • Carlson, G.N. and F.J. Pelletier 1995,The Generic Book,Chicago: University of Chicago Press.
  • Carpenter, B., 1997,Type-Logical Semantics, Cambridge, MA: MIT Press.
  • Cercone, N., P. McFetridge, F. Popowich, D. Fass, C.Groeneboer, and G. Hall, 1993, “The systemX natural language interface:design, implementation, and evaluation,” Tech. Rep. CSS-IS TR 93-03,Centre for Systems Science, Simon Fraser University, Burnaby, BC,Canada.
  • Chambers, N. and D. Jurafsky, 2009, “Unsupervised learning ofnarrative schemas and their participants,” inProceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-09), Singapore, Aug. 2–7. Stroudsburg, PA: Association for Computational Linguistics (ACL).[Chambers and Jurafsky 2009 available online (pdf)]
  • Chater, N. and M. Oaksford (Eds.), 2008,The ProbabilisticMind: Prospects for Rational Models of Cognition, Oxford UniversityPress.
  • Chen, P., W. Ding, C. Bowes, and D. Brown, 2009, “A fullyunsupervised word sense disambiguation method using dependency knowledge,” inProceedings of the Annual Conference of the North American Chapter ofthe ACL (NAACL'09), Boulder, CO, June. Stroudsburg, PA: Association for ComputationalLinguistics (ACL), pp. 28–36.[Chen et al. 2009 available online (pdf)]
  • Chomsky, N., 1956, “Three models for the description of language,”IRE Transactions on Information Theory, 2: 113–124.[Chomsky 1956 available online (pdf)]
  • –––, 1957,Syntactic Structures, Paris: Mouton.
  • Clark, A., C. Fox, and S. Lappin (eds), 2010,The Handbook ofComputational Linguistics and Natural Language Processing, Chichester, UK:Wiley Blackwell.
  • Clarke, Daoud, 2012, “A context-theoretic framework forcompositionality in distributional semantics,”ComputationalLinguistics, 38(1): 41–71.
  • Cohen, A., 2002, “Genericity”, in F. Hamm and T.E. Zimmermann(eds.),Semantics, vol. 10, Hamburg: H. Buske Verlag, 59–89.
  • Cohen, S.B., K. Stratos, M. Collins, D.P. Foster, and L. Ungar, 2013, “Experiments with spectral learning of Latent-VariablePCFGs,”Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2013), June 9–13, Atlanta, GA. Stroudsburg, PA: Association for ComputationalLinguistics (ACL).[Cohen et al. 2013 available online (pdf)]
  • Cohen, P.R. and R. Perreault, 1979, “Elements of a plan basedtheory of speech acts,”Cognitive Science, 3(3): 177–212.
  • Cole, R., B. Wise, and S. van Vuuren, 2007, “How Marni teacheschildren to read,”Educational Technology, 47(1): 14–18.
  • Coles, L.S., 1972, “Techniques for Information Retrieval Using anInferential Question-Answering System with Natural-Language Input”,Technical Note 74, SRI Project 8696, SRI International.
  • Conesa, J., V.C. Storey, and V. Sugumaran, 2010, “Usability of upper-level ontologies: the case of ResearchCyc,”Data &Knowledge Engineering, 69(4): 343–356.
  • Copestake, A., D. Flickinger, I. Sag, and C. Pollard, 2005,“Minimal Recursion Semantics: An introduction,”Research in Language and Computation, 3(2–3): 281–332.
  • Core, M., D. Traum, H.C. Lane, W. Swartout, S. Marsella,J. Gratch, and M. van Lent, 2006, “Teaching negotiation skillsthrough practice and reflection with virtual humans,”Simulation:Transactions of the Society for Modeling and Simulation, 82: 685–701.
  • Cortes, C. and V.N. Vapnik, 1995, “Support-vector networks”,Machine Learning, 20(3): 273–297.
  • Cour, T., C. Jordan, E. Miltsakaki, and B. Taskar,2008, “Movie/Script: alignment and parsing of video and texttranscription,”European Conference on Computer Vision (ECCV),October, Marseille, France.
  • Crocker, M.W., 2010, “Computational Psycholinguistics”, inA. Clark, C. Fox, and S. Lappin (eds),The Handbook of ComputationalLinguistics and Natural Language Processing, Chichester, UK: Wiley Blackwell.
  • Crouch, R., 1995, “Ellipsis and quantification: a substitutionalapproach,” inProceedings of the European Chapter of the Association forComputational Linguistics (EACL'95), University CollegeDublin, March 27–31. Stroudsburg, PA: Association for ComputationalLinguistics (ACL), pp. 229–236.[Crouch 1995 available online (pdf)]
  • Dagan, I., R. Bar-Haim, I. Szpektor, I. Greental, and E. Shnarch, 2008, “Natural language as the basis for meaningrepresentation and inference,” in A. Gelbukh (ed.),ComputationalLinguistics and Intelligent Text Processing, Lecture Notes in ComputerScience 4919, Berlin: Springer, 151–170.
  • Dalrymple, M., S.M. Shieber, and F.C.N. Pereira, 1991, “Ellipsisand higher-order unification,”Linguistics and Philosophy, 14: 399–452.
  • Damásio, A.R., 1994,Descartes' Error: Emotion, Reason, and theHuman Brain, Kirkwood, NY: Putnam Publishing.
  • Damerau, F.J., 1981, “Operating statistics for the Transformational Question Answering System,”American Journal of ComputationalLinguistics, 7(1): 30–42.
  • Davidson, D., 1967a, “The logical form of action sentences”, inN. Rescher (ed.),The Logic of Decision and Action, Pittsburgh, PA: University ofPittsburgh Press.
  • d'Avila Garcez, A.S., 2004. “On Gabbay's fibring methodology forBayesian and neural networks”, in D. Gillies (ed.),Laws and Models inScience, workshop sponsored by the European Science Foundation (ESF),King's College Publications.
  • DeJong, G.F., 1982, “An overview of the FRUMP system,”in W.G. Lehnert and M.H. Ringle (eds.),Strategies for Natural Language Processing, Erlbaum, 149–176.
  • Della Pietra, S., V. Della Pietra, and J. Lafferty, 1997, “Inducing features of random fields,”Machine Intelligence, 19(4): 380–393.
  • de Salvo Braz, R., R. Girju, V. Punyakanok, D. Roth, and M. Sammons, 2005, “An inference model for semantic entailment and question answering,” inProceedings of the American Association for ArtificialIntelligence (AAAI-05), Pittsburgh, PA, July 9–13. Menlo Park, CA: AAAI Press, pp. 1043–1049.
  • Dowty, D., 1991, “Thematic proto-roles and argument selection,”Language, 67(3): 547–619.
  • Duda, R.O. and P.E. Hart, 1973,Pattern Classification and Scene Analysis, New York: Wiley.
  • Duda, R.O., P.E. Hart, and D.G. Stork, 2001,Pattern Classification, New York: Wiley.
  • Dyer, M.G., 1983,In-Depth Understanding, Cambridge, MA: MIT Press.
  • Earley, J., 1970, “An efficient context-free parsing algorithm,”Communications of the ACM, 13(2): 94–102.
  • Faaborg, A., W. Daher, H. Lieberman, and J. Espinosa, 2005,“How to wreck a nice beach you sing calm incense,” in R.St. Amant, J. Riedl, and A. Jameson (eds.),Proceedings of the International Conference on Intelligent User Interfaces (IUI-05), San Diego, CA, January 10–13. ACM Press.[Faaborg et al. 2005 preprint available (pdf)]
  • Falkenhainer, B., K.D. Forbus, and D. Gentner, 1989, “TheStructure-Mapping Engine: algorithm and examples,”ArtificialIntelligence, 41: 1–63.
  • Fan, J., K. Barker, and B. Porter, 2009, “Automaticinterpretation of loosely encoded input,”Artificial Intelligence173(2): 197–220.
  • Fass, D., 1991, “Met*: a method for discriminating metonymy andmetaphor by computer,”Computational Linguistics, 17(1):49–90.
  • Feldman, J.A., 2006,From Molecule to Metaphor: A Neural Theory ofLanguage, Cambridge, MA: Bradford Books, MIT Press.
  • Feldman, J.A. and D.H. Ballard ,1982, “Connectionist models andtheir properties,”Cognitive Science, 6: 205–254.
  • Ferguson, G. and J.F. Allen, 1998, “TRIPS: An integrated intelligent problem-solving assistant,” inProceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). Menlo Park, CA: AAAI Press, pp. 567–573.
  • –––, 2007, “Mixed-initiative systemsfor collaborative problem solving,”AI Magazine, 28(2): 23–32.
  • Ferrucci, D.A., 2012, “This is Watson,”IBM Journal of Research and Development, 56(3–4).
  • Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek,A.A Kalyanpur, A. Lally, J.W. Murdock, E. Nyberg, J. Prager,N. Schlaefer, and C. Welty, 2010, “Building Watson: An overview ofthe DeepQA project,”AI Magazine, 31(3): 59–79.
  • Fine, A.B., T.F. Jaeger, T.A. Farmer, and T. Qian, 2013,“Rapid expectation adaptation during syntactic comprehension,”PLoS ONE, 8(1): e77661.
  • Fleischman, M., and D. Roy, 2005, “Intentional context insituated language learning,” in9th Conference on Computational NaturalLanguage Learning (CoNLL-2005), Ann Arbor, MI, June 29–30. Stroudsburg, PA: Association for Computational Linguistics (ACL).[Fleischman and Roy 2005 available online (pdf)]
  • Gärdenfors, P., 2000,Conceptual Spaces: The Geometry ofThought, Cambridge, MA: MIT Press.
  • Ge, R. and R.J. Mooney, 2009, “Learning a compositional semanticparser using an existing syntactic parser,” inProceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Suntec, Singapore, August.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 611–619.[Ge and Mooney 2009 available online (pdf)]
  • Geib, C.W., 2004, “Assessing the complexity of plan recognition,”inProceedings of the 19th National Conference on Artifical intelligence (AAAI'04), San Jose, CA, July 25–29. Menlo Park, CA: AAAI Press, pp. 507–512.
  • Glickman, O. and I. Dagan, 2005, “A probabilistic settingand lexical cooccurrence model for textual entailment,” inProceedings of ACL Workshop on Empirical Modeling of SemanticEquivalence and Entailment, Ann Arbor, MI, June 30. Stroudsburg,PA: Association for Computational Linguistics (ACL).[Glickman and Dagan 2005 available online (pdf)].
  • Gluck, M.A. and D.E. Rumelhart, 1990,Neuroscience and Connectionist Theory, Hillsdale, NJ: Lawrence Erlbaum.
  • Goldberg, A.E., 2003, “Constructions: a new theoretical approachto language,”Trends in Cognitive Sciences, 7(5): 219–224.
  • Goldman, R.P. and E. Charniak, 1991, “Probabilistic textunderstanding,” inProceedings of the 3rd International Workshop on AI and Statistics, Fort Lauderdale, FL. Also published inD.J. Hand (ed.), 1993,Artificial Intelligence Frontiers inStatistics: AI and Statistics III, London, UK: Chapman & Hall.
  • Gordon, J. and L.K. Schubert, 2010, “Quantificational sharpeningof commonsense knowledge,”Common Sense Knowledge Symposium(CSK-10), AAAI 2010 Fall Symposium Series, November 11-13, Arlington, VA,AAAI Technical Report FS-10-02, Menlo Park, CA: AAAI Press.
  • Gregory, H. and S. Lappin, 1997, “A computational model ofellipsis resolution,” inProceedings of the Conference on Formal Grammar (Linguistic Aspects of Logical and Computational Perspectives on Language), 9th ESSLLI, Aix-en-Provence, France: European Summer School, August 11–27.
  • Grice, H.P., 1968, “Utterer's meaning, sentence meaning and wordmeaning,”Foundations of Language, 4: 225–242.
  • Groenendijk, J. and M. Stokhof, 1991, “Dynamic predicate logic,”Linguistics and Philosophy, 14(1): 39–100.
  • Grosz, B.J. and C.L. Sidner, 1986, “Attention, intentions, andthe structure of discourse,”Computational Linguistics, 12(3,July-September): 175–204.
  • Hadamard, J., 1945,The Psychology of Invention in the Mathematical Field, Princeton, NJ: Princeton University Press.
  • Haghighi, A. and D. Klein, 2010, “Coreference resolution in amodular, entity-centered model,” inProceedings of the Annual Conference of the North American Chapter of the ACL (HLT-NAACL 2010), Los Angeles, June.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 385–393.[Haghighi and Klein 2010 available online (pdf)]
  • Hardt, D., 1997, “An empirical approach to VP ellipsis,”Computational Linguistics, 23(4): 525–541.
  • Harris, L.R., 1984, “Experience with INTELLECT: ArtificialIntelligence technology transfer,”AI Magazine, 5(2): 43–50.
  • Havasi, C., R. Speer, and J. Alonso, 2007, “ConceptNet 3: aflexible, multilingual semantic network for common sense knowledge,” inN. Nicolov, G. Angelova, and R. Mitkov (eds.),Proceedings of Recent Advances in Natural Language Processing (RANLP-07),Borovets, Bulgaria, Sept. 27–29. Amsterdam:John Benjamins.
  • Hearst, M., 1992, “Automatic acquisition of hyponyms from largetext corpora,” inProceedings of the 14th International Conference on Computational Linguistics (COLING-92), Nantes, France, Aug. 23–28.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 539–545.[Hearst 1992 available online (pdf)]
  • Heim, I.R., 1982,The Semantics of Definite and Indefinite NounPhrases, Doctoral dissertaton, University of Massachusetts, Amherst.
  • Henderson, J., 1994, “Connectionist syntactic parsing usingtemporal variable binding,”Journal of Psycholinguistic Research23(5): 353–379.
  • Hendrix, G.G., E.D. Sacerdoti, D. Sagalowicz, and J. Slocum,1978, “Developing a natural language interface to complex data,”ACM Transactions on Database Systems, 3(2): 105–147.
  • Hewitt, C., 1969, “PLANNER: A language for proving theorems inrobots,” in D.E. Wlaker and L.M. Norton (eds.),Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'69), Washington, D.C., May 7–9. Los Altos, CA: William Kaufmann, pp. 295–301.
  • Hobbs, J.R., 1979, “Coherence and coreference,”Cognitive Science, 3(1): 67–90.
  • –––, 2003, “The logical notation: Ontologicalpromiscuity,” in J.R. Hobbs,Discourse and Inference (in progress).[Hobbs 2003 preprint available online].
  • Hobbs, J.R., D.E. Appelt, J. Bear, M. Kameyama, M.E. Stickel,and M. Tyson, 1997, “FASTUS: A cascaded finite-state transducer forextracting information from natural-language text,”in E. Roche and Y. Schabes (eds.),Finite-State Language Processing, Cambridge, MA: MIT Press, 383–406.
  • Hobbs, J.R. and A. Gordon, 2005, “Encoding knowledge ofcommonsense psychology,” in7th International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense 2005), Corfu, Greece, May 22–24.[Hobbs and Gordon 2005 available online (pdf)]
  • Hobbs, J.R., M.E. Stickel, D.E. Appelt, and P. Martin, 1993,“Interpretation as abduction,”ArtificialIntelligence, 63: 69–142.
  • Hoffmann, R., S. Amershi, K. Patel, F. Wu, J. Fogarty, andD. Weld, 2009, “Amplifying community content creation withmixed-initiative information extraction,”ACM Conference on HumanFactors in Computing Systems (CHI 2009), Boston,MA, April 4–9. New York: ACM Press.
  • Hofstadter, D. R. and the Fluid Analogy Research Group, 1995,Fluid Concepts and Creative Analogies: Computer Models of theFundamental Mechanisms of Thought, New York: Basic Books.
  • Hovy, E., 1988, “Planning coherent multisentential text,” inProceedings of the 26th Annual Meeting of the ACL (ACL'88), Buffalo, NY.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 179–186.[Hovy 1988 available online (pdf)]
  • Humphrey, N., 1992,A History of the Mind: Evolution and the Birthof Consciousness, New York: Simon & Schuster.
  • Ide, N. and J. Véronis, 1994, “Knowledge extraction frommachine-readable dictionaries: An evaluation”, in P. Steffens (ed.),Machine Translation and the Lexicon, Berlin: Springer-Verlag.
  • Jackendoff, R.S., 1990,Semantic Structures, Cambridge, MA: MIT Press.
  • Johnson-Laird, P.N., 1983,Mental Models: Toward a CognitiveScience of Language, Inference and Consciousness, Cambridge, MA: Harvard UniversityPress.
  • Johnston, B. and M.-A. Williams, 2009, “Autonomous learning ofcommonsense simulations,” inProceedings of Commonsense 2009, Toronto,Canada, June 1–3. Commonsense Reasoning. [Johnston and Williams 2009 available online (pdf)]
  • Jordan, P., M. Makatchev, U. Pappuswamy, K. VanLehn, andP. Albacete, 2006, “A natural language tutorial dialogue system forphysics,” in G.C.J. Sutcliffe and R.G. Goebel (eds.),Proceedings of the 19th International Florida Artificial Intelligence Research Society (FLAIRS-06). Menlo Park, CA: AAAI Press.
  • Jurafsky, D. and J.H. Martin, 2009,Speech and LanguageProcessing, 2nd edition. Upper Saddle River, NJ: Pearson HigherEducation, Prentice-Hall; original edition, 2000, Upper Saddle River,NJ: Pearson Higher Education, Prentice-Hall.
  • Kamp, H., 1981, “A theory of truth and semanticrepresentation,” in J. Groenendijk, T. Janssen, and M. Stokhof(eds.),Formal Methods in the Study of Language, MathematicsCenter, Amsterdam.
  • Kasabov, N., 1996,Foundations of Neural Networks, Fuzzy Systemsand Knowledge Engineering, Cambridge, MA: MIT Press.
  • Kecman, V., 2001,Learning and Soft Computing, Cambridge, MA:MIT Press.
  • Kim, J. and R.J. Mooney, 2012, “Unsupervised PCFG induction forgrounded language learning with highly ambiguous supervision,” inProceedings of the Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL ‘12), July 12–14, Jeju, Korea.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Kim and Mooney 2012 available online (pdf)]
  • Koehn, Philipp, 2010,Statistical Machine Translation, Cambridge,UK: Cambridge University Press.
  • Kosslyn, S.M., 1994,Image and Brain: The Resolution of theImagery Debate, Cambridge, MA: MIT Press.
  • Kuhlmann, M., 2013, “Mildly non-projective dependency grammar,”Computational Linguistics, 39(2): 355–387.
  • Lafferty, J., A. McCallum, and F. Pereira, 2001, “Conditionalrandom fields: Probabilistic models for segmenting and labelingsequence data,” in C.E. Brodley and A. Pohoreckyj Danyluk (eds.),International Conference on Machine Learning(ICML), Williams College, MA, June 28–July 1. San Francisco: Morgan Kaufmann.
  • Lakoff, G. and M. Johnson, 1980,Metaphors We Live By, Chicago: University of Chicago Press.
  • Landman, F., 1991,Structures for Semantics, SLAP 45, Dordrecht: Kluwer.
  • Landman, F., 1989, “Groups I & II”,Linguistics andPhilosophy, 12(5): 559–605 and 12(6): 723–744.
  • –––, 2000,Events and Plurality, Dortrecht: Kluwer.
  • Lenat, D., 1995, “CYC: A large-scale investment in knowledgeinfrastructure,”Communications of the ACM, 38(11): 33–38.
  • Lewis, D.K., 1970, “General semantics,”Synthese, 22: 18–67. Reprinted in D. Davidson and G. Harman (eds.), 1972,Semantics of Natural Language, Dortrecht: D. Reidel.
  • Lieberman, H., H. Liu, P. Singh, and B. Barry, 2004. “Beatingsome common sense into interactive applications,”AI Magazine, 25(4): 63–76.
  • Lin, D. and P. Pantel, 2001, “DIRT—Discovery of InferenceRules from Text,” inProceedings of the 7th ACM Conference on Knowledge Discovery and Data Mining (KDD-2001), San Francisco, CA, August 26–29. New York: ACM Digital Library, pp. 323–328.
  • Lindsay, R., 1963, “Inferential memory as the basis of machineswhich understand natural language”, in E. Feigenbaum and J. Feldman(eds.),Computers and Thought, New York: McGraw-Hill.
  • Link, G. 1983, “The logical analysis of plurals and mass terms: alattice-theoretical approach”, in R. Bauerle, C. Schwarze, andand A. von Stechow (eds.),Meaning, Use, and Interpretations ofLanguage, Berlin: de Gruyter.
  • Litman, D.J. and S. Silliman, 2004, “ITSPOKE: An intelligenttutoring spoken dialogue system,” inProceedings of the Human LanguageTechnology Conference: 4th Meeting of the North American Chapter of the Associationfor Computational Linguistics (HLT/NAACL-04).Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 233–236.[Litman and Silliman 2004 available online (pdf)]
  • MacCartney, B. and C.D. Manning, 2009, “An extended modelof natural logic,” inProceedings of the 8th International Conference on Computational Semantics (IWCS-8), Tilburg University, Netherlands.Stroudsburg, PA: Association for Computational Linguistics (ACL).[MacCartney and Manning 2009 available online (pdf)]
  • Manger, R., V.L. Plantamura, and B. Soucek, 1994, “Classification with holographic neural networks,” in V.L. Plantamura, B. Soucek, and G. Visaggio (eds.),Frontier Decision Support Concepts, New York: John Wiley and Sons, 91–106.
  • Mann, W. and S.A. Thompson, 1987, “Rhetorical structure theory:description and construction of text structures,” in G. Kempen (ed.),Natural Language Generation: Recent Advances in ArtificialIntelligence, Psychology, and Linguistics, Dortrecht: Kluwer Academic Publishers, 85–96.
  • Mann, W.C. and S.A. Thompson, 1988, “Rhetorical StructureTheory: toward a functional theory of text organization.”Text, 8(3): 243–281.
  • Manning, C.D. and H. Schütze, 1999,Foundations ofStatistical Natural Language Processing, Cambridge, MA: MIT Press.
  • Markert, K. and M. Nissim, 2007, “SemEval-2007 Task 08:metonymy resolution at SemEval-2007,” inProceedings of SemEval 2007, Prague.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Markert and Nissim 2007 available online (pdf)]
  • Martin, J.H., 1990,A Computational Model of MetaphorInterpretation, New York: Academic Press.
  • Massaro, D.W., M.M. Cohen, M. Tabain, J. Beskow, and R. Clark,2012, “Animated speech: Research progress and applications,” inG. Bailly, P. Perrier, and E. Vatiokis-Bateson (eds.),AudiovisualSpeech Processing, Cambridge, UK: Cambridge University Press, 309–345.
  • May, A., 2012, “Machine translation,” chapter 19 of A.Clark, C. Fox, and S. Lappin (eds.),The Handbook of ComputationalLinguistics and Natural Language Processing, Chichester, UK: Wiley Blackwell.
  • Mayberry, III, M.R. and R. Miikkulainen, 2008, “Incrementalnonmonotonic sentence interpretation through semanticself-organization,” Technical Report AI08-12, Dept. of ComputerScience, University of Texas, Austin, TX.
  • Maybury, M.T. (ed.), 2004,New Directions in Question Answering,Cambridge, MA: AAAI and MIT Press.
  • McCarthy, J., 1990, “First order theories of individualconcepts and propositions,” in J. McCarthy and V. Lifschitz(eds.), 1990,Formalizing Common Sense: Papers by JohnMcCarthy, 119–141. (Note: An earlier version appeared inJ.E. Hayes, D. Michie, and L. Mikulich (eds.), 1979,MachineIntelligence 9, Chichester/Halsted, New York: Ellis Norwood,129–148. [See also the [most recent version, 2000.]
  • McClain, M. and S. Levinson, 2007, “Semantic based learning ofsyntax in an autonomous robot,”International Journal of Humanoid Robotics, 4(2): 321–346.
  • McCord, M., 1986, “Focalizers, the scoping problem, and semanticinterpretation rules in logic grammars,” in M. Van Caneghem andD.H.D. Warren (eds.),Logic Programming and its Applications, Norwood, NJ: Ablex, 223–243.
  • McKeown, K.R., 1985,Text Generation: Using Discourse Strategiesand Focus Constraints to Generate Natural Language Text, Cambridge,UK: Cambridge University Press.
  • Minsky, M., 1968,Semantic Information Processing, Cambridge,MA: MIT Press.
  • Moldovan, D.I. and V. Rus, 2001, “Logic form transformation ofWordNet and its applicability to question answering,” inProceedings of ACL2001, Association for Computational Linguistics, Toulouse, France, July 6–11.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 402–409.[Moldovan and Rus 2001 available online (pdf)]
  • Montague, R., 1970, “English as a formal language,” in B.Visentiniet al. (eds.),Linguaggi nella società e nellatecnicà, Milan: Edizioni di Comunita, 189–224.
  • –––, R., 1973, “The proper treatment of quantification inordinary English,” in K.J.J. Hintikka, J.M.E. Moravcsik, and P.Suppes (eds.),Approaches to Natural Language, Dortrecht: D. Reidel.
  • Mooney, R.J., 2007, “Learning for semantic parsing,”in A. Gelbukh(ed.),Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2007), Mexico City, February. Berlin: Springer, pp. 311–324.
  • Moore, J.D. and C.L. Paris, 1988, “Constructing coherent textsusing rhetorical relations,” inProceedings of the 10th Annual Conference of the Cognitive Science Society (COGSCI-88), Montreal, Canada, August 17–19. New York: Taylor & Francis, pp. 199–204.
  • –––, 1993, “Planning texts for advisorydialogues: capturing intentional and rhetorical information,”Computational Linguistics, 19(4): 651–694.
  • Mostow, J. & J. Beck, 2007, “When the rubber meets theroad: Lessons from the in-school adventures of an automatedreading tutor that listens,” in B. Schneider & S.-K.McDonald (eds.),Scale-Up in Education (Vol. 2), Lanham, MD: Rowman & Littlefield Publishers, 183–200.
  • Movellan, J.R., M. Eckhardt, M. Virnes, and A. Rodriguez, 2009,“Sociable robot improves toddler vocabulary skills,” inProceedings of the International Conference on Human Robot Interaction (HRI2009), San Diego, CA, March 11–13. New York:ACM Digital Library.
  • Narayanan, S., 1997,KARMA: Knowledge-based Action Representationsfor Metaphor and Aspect, Doctoral thesis, U.C. Berkeley, CA.
  • Newell, A. and H.A. Simon, 1976, “Computer science as empiricalinquiry: symbols and search,”Communications of the ACM,19(3): 113–126.
  • Nielsen, L.A., 2004, “Verb phrase ellipsis detection usingautomatically parsed text”, inProceedings of the 20th International Conference on Computational Linguistics (COLING'04), Geneva, Switzerland.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Nielsen 2004 available online (pdf)]
  • Norman, D.A., D.E. Rumelhart, and the LNR Research Group, 1975,Explorations in Cognition, New York: W.H. Freeman and Company.
  • Onyshkevych, B., 1998, “Nominal metonymy processing,”COLING-ACL Workshop on the Computational Treatment of Nominals, Aug. 10–14, Quebec, Canada.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Onyshkevych 1998 available online (pdf)]
  • Paivio, A., 1986,Mental representations: a dual codingapproach, Oxford, England: Oxford University Press.
  • Palmer, M., D. Gildea, and N. Xue, 2010,Semantic Role Labeling,San Rafael, CA: Morgan & Claypool.
  • Palmer-Brown, D., J.A. Tepper, and H.M. Powell, 2002, “Connectionist natural language parsing,”Trends in Cognitive Sciences6(October): 437–442.
  • Pantel, P., R. Bhagat, B. Coppola, T. Chklovski, and E. Hovy,2007, “ISP: Learning Inferential Selectional Preferences,” inProceedings of North American Association for Computational Linguistics / Human Language Technology (NAACL-HLT'07), Rochester, NY.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 564–571.[Pantel et al. 2007 available online (pdf)]
  • Parsons, T., 1990,Events in the Semantics of English: A Study inSubatomic Semantics, Cambridge, MA: MIT Press.
  • Pereira, F.C.N. and D.H.D. Warren, 1982, “An efficient easilyadaptable system for interpreting natural language queries,”American Journal of Computational Linguistics, 8(3–4): 110–122.
  • Pierce, J.R., J.B. Carroll,et al., 1966,Language andMachines—Computers in Translation and Linguistics. ALPACreport, National Academy of Sciences, National Research Council,Washington, DC.
  • Pinker, S., 1994,The Language Instinct, New York: Harper.
  • –––, 2007,The Stuff of Thought, New York: Penguin Group.
  • Plate, T.A., 2003,Holographic Reduced Representation:Distributed Representation for Cognitive Structures, Stanford,CA: CSLIPublications.
  • Poincaré, H., 1913, “Mathematical creation,”in H. Poincaré (with an introduction by J. Royce),The Foundationsof Science: Science and Method, Book I (Science and the Scientist),chapter III, New York: The Science Press.
  • Pon-Barry, H., K. Schultz, Owen E. Bratt, B. Clark, and S.Peters, 2006, “Responding to student uncertainty in spoken tutorialdialogue systems,”International Journal of Artificial Intelligence in Education, 16(2): 171–194.
  • Poon, H., 2013, “Grounded unsupervised semantic parsing,” inProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13), Sofia, Bulgaria, Aug. 4–9.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Poon 2013 available online (pdf)]
  • Poon, H. and P. Domingos, 2009, “Unsupervised semanticparsing,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, Aug. 6–7. Stroudsburg, PA: Association for Computational Linguistics (ACL).[Poon and Domingos 2009 available online (pdf)]
  • Pullum, G., 2011, “Remarks by Noam Chomsky in London”,Linguist, November 14, 2011, 22.4631.[Pullum 2011 available online]
  • Pulman, S., J. Boye, M. Cavazza, Smith and C.R. Santos de la Cámara, 2010, “How was your day?”inProceedings of the 2010 Workshop on Companionable Dialogue Systems,Uppsala, Sweden, July. Stroudsburg, PA: Association for ComputationalLinguistics, 37–42.
  • Quillian, M.R., 1968, “Semantic memory,” in M. Minsky (ed.)Semantic Information Processing, Cambridge, MA: MIT Press, 227–270.
  • Ramsay, W., S.P. Stich, and J. Garon, 1991, “Connectionism,eliminativism and the future of folk psychology,” in W. Ramsay, S.P.Stich, and D.E. Rumelhart (eds.),Philosophy and Connectionist Theory,Hillsdale, NJ: Lawrence Erlbaum.
  • Rapaport, W.J., 1995, “Understanding understanding: Syntacticsemantics and computational cognition,” in J.E. Tomberlin (ed.),AI, Connectionism, and Philosophical Psychology, PhilosophicalPerspectives, Vol. 9, Atascadero, CA: Ridgeview Publishing, 49–88.
  • Raphael, B., 1968, “SIR: A computer program for semanticinfomation retrieval,” in M. Minsky (ed.)Semantic InformationProcessing, Cambridge, MA: MIT Press, 33–145.
  • Ratnaparkhi A., 1997, “A simple introduction to maximum entropymodels for natural language processing”, Technical Report 97–08,Institute for Research in Cognitive Science, University of Pennsylvania.
  • Reichenbach, H., 1947,Elements of Symbolic Logic, New York: MacMillan.
  • Rich, C. and C.L. Sidner, 1998, “COLLAGEN: A collaborationmanager for software interface agents,”User Modeling and User-AdaptedInteraction, 8(3–4): 315–350.
  • Robinson, D.N., 2007,Consciousness and Mental Life, New York, NY: Columbia University Press.
  • Rumelhart, D.E., P.H. Lindsay, and D.A. Norman, 1972, “A processmodel for long-term memory,” in E. Tulving and W. Donaldson,Organization and Memory, New York: Academic Press, 197–246.
  • Rumelhart, D.E. and J. McClelland, 1986,Parallel DistributedProcessing: Explorations in the Microstructure of Cognition,Cambridge, MA: MIT Bradford Books.
  • Salton, G., 1989,Automatic Text Processing: The Transformation,Analysis and Retrieval of Information by Computer, Boston, MA:Addison-Wesley.
  • Scha, R., 1981, “Distributive, collective and cumulativequantification,” in J. Groenendijk, T. Janssen, and M. Stokhof(eds.),Formal Methods in the Study of Language, Amsterdam: Mathematical CentreTracts.
  • Schank, R.C. and R.P. Abelson, 1977,Scripts, Plans, Goals andUnderstanding, Hillsdale, NJ: Lawrence Erlbaum.
  • Schank, R.C. and K.M. Colby, 1973,Computer Models of Thoughtand Language, San Francisco: W.H. Freeman and Co.
  • Schank, R.C. and C.K. Riesbeck, 1981,Inside ComputerUnderstanding, Hillsdale, NJ: Lawrence Erlbaum.
  • Scheutz, M., R. Cantrell, and P. Schermerhorn, 2011 “Towardhuman-like task-based dialogue processing for HRI,”AI Magazine32(4): 77–84.
  • Schoenmackers, S., J. Davis, O. Etzioni, and D.S. Weld, 2010,“Learning first-order Horn clauses from Web text,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), MIT, MA, Oct. 9–11.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Schoenmackers et al. available online (pdf)]
  • Schubert, L.K., 2000, “The situations we talk about”, in J. Minker (ed.),Logic-Based Artificial Intelligence, Dortrecht: Kluwer, 407–439.
  • –––, 2007, “Implicit Skolemization: efficient referenceto dependent entities,”Research on Language and Computation 5, April, (special volume on Binding Theory, edited by A. Butler, E. Keenan, J.Mattausch and C.-C. Shan): 69-86.
  • Schubert, L.K. and F.J. Pelletier, 1982, “From English to logic:Context-free computation of ‘conventional’ logical translations”,American Journal Computational Linguistics, 8: 27–44; reprinted in B.J. Grosz, K. Sparck Jones, and B.L. Webber (eds.), 1986,Readings in Natural Language Processing, Los Altos, CA: Morgan Kaufmann,293–311.
  • Schubert, L.K. and M. Tong, 2003, “Extracting andevaluating general world knowledge from the Brown Corpus,” inProceedings of the HLT-NAACL Workshop on Text Meaning, Edmonton, Alberta, May 31.Stroudsburg, PA: Association for Computational Linguistics (ACL), 7–13.[Schubert and Tong 2003 available online (pdf)]
  • Searle, J., 1969,Speech Acts, Cambridge, UK: Cambridge University Press.
  • Sebestyen, G.S., 1962,Decision-Making Processes in PatternRecognition, New York: Macmillan.
  • Sha, F. and F. Pereira, 2003, “Shallow parsing with conditionalrandom fields,”Human Language Technology Conference (HLT-NAACL 2003), May 27 – June 1, Edmonton, Canada.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Sha and Pereira 2003 available online (pdf)]
  • Singh, P., T. Lin, E.T. Mueller, G. Lim, T. Perkins, and W.l.Zhu, 2002, “Open Mind Common Sense: Knowledge acquisition from thegeneral public,” inProceedings of the 1st International Conference on Ontologies, Databases, and Applications of Semantics for Large Scale Information Systems (ODBASE 2002), Irvine, California, October 29–31.Lecture Notes in Computer Science, Volume 2519, New York: Springer,pp. 1223–1237.
  • Smith, R.W., D.R. Hipp, and A.W. Biermann, 1995, “Anarchitecture for voice dialogue systems based on Prolog-style theoremproving,”Computational Linguistics, 21: 281–320.
  • Smolensky, P., 1988, “On the proper treatment of connectionism,”The Behavioral and Brain Sciences, 11: 1–23.
  • Smolensky, P., G. Legendre, and Y. Miyata, 1992, “Principles foran integrated connectionist and symbolic theory of higher cognition,”Technical Report CU-CS-600-92, Computer Science Department, Universityof Colorado, Boulder, CO.
  • Snow, R., B. O'Connor, D. Jurafsky, and A.Y. Ng, 2008, “Cheapand fast—but is it good? Evaluating non-expert annotations fornatural language tasks,” inProceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2008),Waikiki, Honolulu, Oct. 25–27.Stroudsburg, PA: Association for Computational Linguistics (ACL), 254–263.[Snow et al. 2008 available online (pdf)]
  • Steedman, Mark, 2007, “Presidential Address to the 45th AnnualMeeting of the ACL”, Prague, June 2007. Also printed in 2008,“On becoming a discipline”,Computational Linguistics, 34(1): 137–144.[Steedman 2007 [2008] available online]
  • Sun, R., 2001, “Hybrid systems and connectionist implementationalism” (also listed as 2006, “Connectionist implementationalism and hybrid systems,”), in L. Nadel (ed.)Encyclopedia of Cognitive Science, London, UK: MacMillan.
  • Tenenbaum, J. B., C. Kemp, T.L. Griffiths, and N.D. Goodman,2011, “How to grow a mind: Statistics, structure, and abstraction,”Science, 331: 1279–1285.
  • Thompson, F.B., P.C. Lockermann, B.H. Dostert, and R. Deverill,1969, “REL: A rapidly extensible language system,”inProceedings of the 24th ACM National Conference, New York:ACM Digital Library, pp. 399–417.
  • Thompson, F.B. and B.H. Thompson, 1975, “Practical naturallanguage processing: The REL system as prototype,” in M. Rubinoffand M.C. Yovits (eds.),Advances in Computers, vol. 13, New York: Academic Press, 109–168.
  • Turney, P.D., 2008, “The Latent Relation Mapping Engine: algorithmand experiments,”Journal of Artificial Intelligence Research, 33: 615–655.
  • Van Durme, B., P. Michalak, and L.K. Schubert, 2009, “Deriving generalized knowledge from corpora using WordNetabstraction”,12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09), Athens, Greece, Mar. 30–Apr. 3.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Van Durme, Michalak and Schubert 2009 available online (pdf)]
  • Veale, T. and Y. Hao, 2008, “A fluid knowledge representationfor understanding and generating creative metaphors,”The 22ndInternational Conference on Computational Linguistics (COLING 2008), Manchester, UK, Aug. 18–22.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 945–952.[Veale and Hao 2008 available online (pdf)]
  • Vijay-Shankar, K. and D.J. Weir, 1994, “Parsing someconstrained grammar formalisms,”Computational Linguistics, 19(4): 591–636.
  • von Ahn, L., M. Kedia, and M. Blum, 2006, “Verbosity: A game for collecting common-sense knowledge,” in R. Grinter, T. Rodden,P. Aoki, E. Cutrell, R. Jeffries, and G. Olson (eds.),Proceedings of the SICHI Conference on Human Factors in Computing Systems (CHI 2006), Montreal, Canada, April 22–27. New York: ACM Digital Library, pp. 75–78.
  • Weischedel, R.M. and N.K. Sondheimer, 1983, “Meta-rules as abasis for processing ill-formed input,”American Journal of Computational Linguistics, 9(3–4): 161–177.
  • Widdows, D., 2004,Geometry and Meaning, Stanford, CA: CSLI Publications.
  • Wilks, Y., 1978, “Making preferences more active,”Artificial Intelligence, 11(3): 197–223; also in N.V. Findler (ed.), 1979,Associative Networks: Representation and Use of Knowledge, Orlando, FL: Academic Press, 239–266.
  • –––, 2010, “Is a companion a distinctive kind ofrelationship with a machine?” inProceedings of the 2010 Workshop onCompanionable Dialogue Systems, (CDS ‘10), Uppsala, Sweden.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 13–18.[Wilks 2010 available online (pdf)]
  • Winograd, T., 1972,Understanding Natural Language, NewYork: Academic Press.
  • Woods, W.A., R.M. Kaplan, B.L. Nash-Webber, 1972, “The LunarSciences Natural Language Information System: Final Report”, BBN ReportNo. 2378, Bolt Beranek and Newman Inc., Cambridge, MA. (Available fromNTIS as N72-28984.)
  • Yarowsky, D., 1992, “Word-sense disambiguation using statisticalmodels of Roget's categories trained on large corpora,” inProceedings of the 14th International Conference on Computational Linguistics (COLING-92), Nantes, France.Stroudsburg, PA: Association for Computational Linguistics (ACL), pp. 454–60.[Yarowsky 1992 available online (pdf)]
  • Younger, D.H., 1967, “Recognition and parsing of context-freelanguages in timen3,”Information and Control, 10(2): 189–208.
  • Zettlemoyer, L.S. and M. Collins, 2007, “On-line learning ofrelaxed CCG grammars for parsing to logical form,” inProceedings of the Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning (EMNLP-CoNLL 2007), Prague, June 28–30.Stroudsburg, PA: Association for Computational Linguistics (ACL).[Zettlemoyer and Collins 2007 available online (pdf)]

Other Internet Resources

Acknowledgments

The author and editors would like to thank an anonymous externalreferee for the time he spent and the advice he gave us for improvingthe presentation in this entry.

Copyright © 2014 by
Lenhart Schubert<schubert@cs.rochester.edu>

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free

Browse

About

Support SEP

Mirror Sites

View this site from another server:

USA (Main Site)Philosophy, Stanford University

The Stanford Encyclopedia of Philosophy iscopyright © 2024 byThe Metaphysics Research Lab, Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054


[8]ページ先頭

©2009-2025 Movatter.jp