- Notifications
You must be signed in to change notification settings - Fork34
Neuro-symbolic interpretation learning (mostly just language-learning, for now)
License
opencog/learn
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The "big ideas" describe how this project thinks about the world.The "medium ideas" describe the basic algorithm being used.The "small ideas" describe what has been implemented and is doable(has been done) with the current technology and code base. Current focusis on language; we're looking for collaborators on image recognitionand/or movement classification.
Today, AI (AGI) theory stands divided into two camps: the neural netapproaches, and symbolic approaches. Both approaches have importantresults and shed light on the key problem of intelligence. In themiddle, between them is the symbol grounding problem. A common complaintabout symbolic reasoning over common sense knowledge is the problem ofknowing what a symbol is: when a chatbot talks about chairs and tables,sports teams and favorite colors, how can it know what a "chair" is?A "chair" is a symbol, but what does it "mean"? Conventional neural netapproaches have to opposite problem: they don't have or use symbols,and so are not amenable to leveraging the power of logical inference.
The aim of this project is to solve the symbol grounding problem. Theproposed solution is that symbols are simply names for large, complex,sparse networks of hierarchical relationships. Thus, the goal isto discover the large, complex, sparse networks of hierarchicalrelationships in unstructured data. Symbols then become coherentnetworks of relationships. Thus, a "chair" is both a collection of all"facts" one might know about chairs, as well as an English wordparticipating in the complex grammatical network of the Englishlanguage. The word "chair" is not so much a "symbol" as it is a bridgebetween these complex networks. Reasoning can be done "sub-symbolically",on the network of relationship; the result of reasoning can bearticulated verbally, adhereing to the rules of grammar. Reasoning itselfis not (inherently) logical; rather, reasoning is powered by a set of"valid" manipulations applied to the network, the validity of which isalso learned and encoded in a complex network. Thus, the goal of thisproject is to not only learn what a "chair" is from environmentalstimulous, and not only learn how to use the word "chair" in agrammatical sentence, but also to learn the rules of logic, and therules of common sense reasoning, out of thing air. The rest of thisfile (and other documents in this and other repos) expand in detailhow the above can be accomplished.
BTW: to be clear: the aim of this project is to create a system that caninterpret vision, and sound, and language, and sensory data in general,from a single, unified theoretical framework. The proof of concept forsome of these domains is underway, albeit proceeding slowly. Helpersneeded.
Here's the big idea:everything is a graph -- all knowledge is linkedinformation. For example, knowledge-bases describe relations betweenthings - entities, ideas, concepts.
Graphs are described by local connectivity: what is connected to what,nearby. Different problem domains call these local connections by variousdifferent names: They might be called assertions, statements, rules, facts,axioms, ontologies, abstract syntax trees, directed acyclic graphs,decision trees, decision forests, ProLog statements, and so on.
Many local neighborhoods of a graph look alike, and connect in similarways. Such similarities also have many different names: an "instance ofa class", an "example of this type", a "template", a "general rule", an"axiom schema", and so on.
The way in which regions of graphs look locally similar can be describedby agrammar. The way they connect is asyntax. Valid graphs haveshapes that follow from the syntax/grammar. Themeaning or thesemantics of the graph lies in the connections themselves.
This project takes the above to be the structure ofknowledge andmeaning. Stronger and more detailed arguments for why this is thecorrect definition of "knowledge" and "meaning", together with concretedefinitions, details and examples are provided by the PDF's in theAtomSpace "sheaf" directory.
If you want to describe something, some idea, some concept, somesituation or event, you have several choices: draw a picture, make amovie, write some text, dance about it, build a machine.
If you are writing text, what you are "actually doing" is taking thenetwork of interconnected facts/ideas, and serializing them into asequence of words. A sequential, time-like order, one word afteranother. Youserialize thegraph of ideas.
When you read, and try to understand and learn, you try todeserializethe words, and reconstruct the graph-network of ideas in your mind.
The goal of this project is to build aserializer/deserializer pair.More narrowly, to convert natural language into a graph of ideas, andconversely, to express the knowledge-graph as a sequence of sentences.Narrower still, the goal is to extract the grammar and syntax froma corpus of natural language text.
This narrow focus is the starting point. Thus, most of what follows willtalk about grammar. That focus is necessary to make forward progress,although the vision above hints at a far more general, far more powerfulpossibility for working with knowledge. This vision should be portableto (bio-)chemical structures, the 3D-shapes of physical objects, thecorrelation of images to text, the correlation of movement and actionto sound and text. But to keep things focused, for now its just text.
The "graph of knowledge" sketched above is assumed to be asparsegraph, or olde-school "symbolic AI". This is in contrast to neural netsand deep learning, where knowledge is encoded in a dense graph, thenetwork connectivity graph of weight matrices, vectors and the like.The deep learning industry has plenty of folks working on it. It'sa multi-billion-dollar industry. We are not competing with that behemoth.The goal of this project is to move forward on sparse-graph knowledgerepresentation. (Note however: some parts of sparse graphs are denselyconnected, and, for those parts, deep-learning type structures may beideal. This idea is explored in detail in several PDF's here andelsewhere. The heading for that discussion is "matrix factorization",or rather, "sparse matrix factorization.")
This is an ongoing project, with continuing if sometimes sporadic activity.
In a text, there is both a small-scale structure, and a large scalestructure. The small scale structure consists of grammatically-correctsentences written in some natural language. The large-scale structureis determined by the type of the text: is it a short story?A dissertation? An owner's manual? The large scale flows from sentenceto sentence, paragraph to paragraph, as repeating ideas, characters,topics, themes reoccur throughout the text.
There is also an intermediate-scale structure that has beenwell-developed by linguists; a particularly apt one is Mel'čuk's"Meaning-Text Theory", which describes ideas like "Deep Syntax"and "Lexical Functions".
The goal of this project is to recursively extract deeper and deeperlayers of structure, starting from the surface structure of a sequenceof words. The author believes this can be done, so perhaps one couldsay the goal of this project is to prove that it can be done.
Effectively, this means that the goal is really to produce anautonomous agent. This agent observes its environment (atime-series of sensory inputs, such as sight, sound, and, to keep itsimple, plain text). From these observations, it builds aworld model, that encapsulates what it has learned. Finally,based on it's world model, it generates actions. Something gets done,some action is performed (for example, a physical movement of a robot,a change of facial expression in a robot, or, to keep things simple,generation of some text.)
In the narrowest sense, this agent can be imagined to be a chatbot,perhaps answering questions based on pattern matches on it's world-model.More generally, the goal is that the agent should be able to conductentertaining conversations when called upon, or to write eruditeparagraphs (of grammatically correct English, in appropriate narrativeform) articulating some particular portion of the World Model. As theWorld Model necessarily includes a self-model, this implies a certaindegree of "self-awareness". In this sense, the agent is imagined to bean AGI, in the complete sense.
Obviously, no such thing has been created. However, the current methodsdeveloped here appear to be promising, and there is no apparentimpediment in sight, other than perhaps the general scale of the projectitself. Plainly, there's a lot of code to write, test, debug, and alot of research and development to be performed.
The above sketch is sufficiently abstract that, while it may be hard todisagree with, it may also be hard to figure out how to write code forit. Below is a sketch of what is actually done. It involves a recurringprocess of generating vectors (of several different kinds), applyingsimilarity measures to those vectors, and then forming clusters orclasses (or "islands") of similar vectors. Distance information can beused to extract a graphical structure, which is decomposed into nearestneighbors, which can then be formed again into vectors. This is usedto ratchet up a part-whole hierarchy, where parts are distinctlyidentified as "those things that are similar" and the whole-relationshipis given by the local graphical structure. As icing on the cake, allof the above is done with maximum-entropy principles in mind.
The general process is as follows:
- Make pair-wise observations, counting frequency of occurrences.
- Interpret pairs as the rows & columns of a matrix: each row and columnis a vector. These vectors are extremely high-dimensional and sparse.
- Obtain pair-wise mutual information (MI). Any distance measure will do,but the information-theoretic ground of maximum entropy principlesseems best.
- Re-observe data, this time using the pair-wise distances to forma spanning tree or spanning graph. The only edges allowed in thisgraph are those with high MI (i.e. small distance, those that areclose to one-another.)
- Factor the graph into vertices and their nearest neighbors. Thisfactorization resembles an N-gram or skip-gram: the list of neighborsis that skip-gram.
- Just as skip-grams can be interpreted as vectors, the same is possiblehere. The pair (vertex, nearest-neighbors) can be taken as the rowand column addresses of a matrix, and the numeric value of that matrixentry is just the number of times that (vertex, neighbor) combinationhas been observed.
- Apply similarity measures to the row-vectors of this matrix, andcluster together or agglomerate similar vectors.
- These clusters are then the desired outcome: these are the classes ofsimilar or identical observations. The next time some sequence ofevents is observed, it can be classed into one of these classes.
- Because the classes have (vertex, neighbor) information, they encodea graph. That is, they explicitly encode part-whole relationships.Any given vertex is a "part"; its neighbors determine how it'sconnected to the "whole".
To obtain hierarchical relationships, one notes that the dimensionalreduction obtained through clustering can then be used to tacklecombinatoric explosions and sparse data "at the next level". Thus,for example, N-grams/skip-grams are famously high dimensional (e.g.for a vocabulary of V = 10K words, there are then (10K)^3 = trillionpossible 3-grams) By the graphical agglomeration procedure givenabove, this can be reduce to P=100 different "parts of speech"(grammatical classes: nouns, adjectives, verbs, transitive verbs, etc.)with graphical structure: the encoded graph is much much smaller.Now that the data size is manageable, again, the learning processcan be restarted, this time looking at correlations between sentences,between paragraphs, as opposed to correlations between words. So,for example, if a paragraph contains N=100 words, there gives apractical for treating that N-gram with N=100, as the (graphical)structure of smaller units in that paragraph are now available.
A key difference between the code here, and conventional neural nettechniques is that all training here is "unsupervised". There are no"training sets"; the learning is done entirely by extracting statisticalregularity from the input stream. In the parlance, it is not "turtlesall the way down"; but rather "the buck stops here".
This refers to the problem of training in neural nets: the training setsare created by expert human markup, usually by grad students. In effect,the neural net is only as smart as the grad student who provided thetraining data. Politically charged examples of this includes facialrecognition suites that mistreat people of color, as opposed to whitepeople. This effect has been termed "turtles all the way down", inreference to the idea that the Earth sits on the back of a turtle:but what does the turtle sit on? The neural net sits on the back ofa grad student.
The current code base implements more-or-less traditionalhigh-dimensional agglomerative clustering. It does NOT use 3rd-partysystems or libraries, for three reasons:
- Other systems do not handle extremely sparse data efficiently.
- The high-dimensional data produced here is already in the form of agraph, and is stored in a graph format in the AtomSpace. Copyingthis data out, and back in again, is wasteful of CPU. It alsorequires complicated import/export code. The complexity ofimport/export is about equal to do-it-yourself clustering, sonothing is gained by using 3rd-party tools.
- The formation of clusters here is not quite as simple as throwingsimilar labels into buckets. The vectors have to be merged,basis-element by basis-element, and those vectors encode graphicalstructure. For example, when one merges (vertex-A, neighbors-of-A)with (vertex-B, neighbors-of-B), it is not enough to merge justA and B, one must also merge (vertex-X, neighbors-of-X) whenevera neighbor-of-X includes A or B. This would require a callback fromthe clustering code to perform that graphical merge. 3rd-partytools are incapable of such callbacks.
There are two key differences between the vectors being formed here,and the vectors that are commonplace in neural-net learning systems.First is that the vectors here are extremely high-dimensional andsparse. Second is that vector similarity is NOT judged by cosine angles,but by information-theoretic measures.
A short tirade: the neural-net people are probably making a mistake byusing cosine angles. The cosine is great when a vector lives in aEuclidean space, because Euclidean spaces have rotational symmetry.But there is no reason to expect neural-net vectors to live in Euclideanspace, and it is rather silly to force a rotational symmetry onto aproblem that has none. The problem here is the word "vector".
In conventional mathematics, a "vector" is a collection of (number,basis-element) pairs, and exists "naturally" in Euclidean space. But,for neural nets, and for the current code base, this collection isshould be interpreted as a (number-between-zero-and-one, basis-element),that is, as a frequency or probability: one should write P(A) asthe number (probability) at basis element A (event A). The "vectors"are actually points in a simplex; and have the structure of asigma-algebra. Whenever a vector has the structure of a sigma algebra,it should be interpreted as a probability, and the similarity of twovectors is then given by (take your pick) the conditional probability,the Kullback-Leibler divergence, the conditional or mutual entropy,the mutual information, etc. It isNOT given by the dot-product orthe cosine angle! Note, BTW, that the mutual information does includea dot-product inside of it; it just has different normalization factorsthat break Euclidean rotational symmetry, and replace it by maximumentropy principles.
Another major difference between the code here, and conventionalneural net theory, is that the explicit graphical structure is extractedfrom the data, in the form of graphical nearest-neighbors. Thus, a(word, neighboring-words) is the conventional definition of an N-gram.If some neighbors are excluded, this is a skip-gram. The present caseis similar, but with several key distinctions.
First: neighbors are determined not by the physical proximity of oneword to another (how close they are in a sentence), but rather by themutual information between the words. (Here, a "word" can be literallya "word", or more generically "some time-series event observed innature".)
Second: it is not the raw MI-proximity that counts, but rather, thespanning tree or spanning graph that connects vertices (words). Thatis, in formulating this "graphical skip-gram", one does not just grabthe nearest neighbors as measured by the distance function. One insteadattempts to minimize/maximize the distance measure (the entropy) forthe entire local graph. This has the effect of discarding (not forming)some edges despite those edges being short: it is more important tominimize/maximize the local, collective graph, the local connectivity ofall vertices, and not just pairs of them.
Third: the use of "graphical skip-grams", as described here, providesa natural bridge to the concept of syntax and grammar. That is, a"graphical skip-gram" is a syntactic element, in a very real and formalsense of languages and grammars. This is a foundational, keyobservation, the details of which can be found in 1991 Link Grammarpapers: Sleator, Temperley,"Parsing English with a Link Grammar".The earliest mention of these concepts seems to be in a book:"Algebraic Linguistics; Analytical Models",from 1967, by Solomon Marcus (published by Elsevier).
Fourth: the result of the agglomerative clustering on feature vectorscan be interpreted as "types", in the formal type-theoretical sense.The Link Grammar "link types" really are types. They are short-handfor ideas like S\NP and S/VP found in pregroup grammars, or moregenerally in phrase-structure grammars. The "learning" being performedhere does not "just" glom together identical feature vectors; it doesmore: it explains exactly how those feature vectors should be interpretedas graphs, and how that graph structure has regularities defined by agrammar.
Fifth: meaning. There are frequent arguments made that neural netsextract "meaning", or that weight vectors can be interpreted as"meaning". These arguments carry over into the present case. Theyare strengthened, because the feature vectors are no long "justvectors", they are also components of a graph, components of a grammar.
There are many -- many dozens -- of different NLP projects andframeworks -- written in Python, Java, Perl, and another half-dozenprogramming ecosystems. The code here uses none of these. Why?
The answer is illustrated by a a simple, practical problem encounteredwhen one tries to perform scientific research withing these frameworks.It is illustrated in theREADME-Architecturefile.
A simple sketch of how the above can be applied to vision (2D images) ispresented here. No code has been written. Verylittle theoretical exploration has been done.
(See also:Language learning wikifor an alternate overview.)
In order to make forward progress, the current actual scope of theproject is limited to what is reasonably achievable with a relativelysmall amount of work. The current primary focus is on unsupervisedlanguage learning.
The goal of this project is to create a system that is capable oflearning the grammar and some of the semantics of natural language.The fundamental goal is to do this in an unsupervised fashion, withno training data beyond that of un-annotated raw text.
An early draft of how this can be done is presented in the paper"Language Learning", B. Goertzel and L. Vepstas (2014) on ArXiv; seeArXiv abs/1401.3372. Some of thisis repeated on thelanguage learning wiki.Updates, describing how to move past state-of-the-art results, arepresented in the"Sheaf Theory"paper and in the"Neural-Net vs. Symbolic Machine Learning"paper. Other ideas and details are sketched in"Stitching Together Vector Spaces" and"Messaging".
The key process is a sequence of steps that can be used to extract thestructure of the "language graph", with each step being noisy anderror-prone, but the subsequent step averaging over this structure insuch a way as to cancel noise, and allow the actual graphical structureto emerge. Central to this process is extracting the actual structure ofthe graph, as opposed to the extraction of vector spaces, as is currentlydone with neural-net techniques or other styles of clustering.
The core insight is that although vector spaces can be layered ontolanguage data (thus explaining the success of modern-day neural nettechniques), there is also a way of "stitching together" these vectorspaces in such a way that they extract the underlying graphicalstructure. The appearance of vector spaces is not an accident: theaxioms of algebraic linguistics (such as those found in categorialgrammar, pregroup grammar, link grammar) are similar to the axioms thatdefine a vector space. The similarity accounts for why techniques suchas Word2Vec are successful. The dis-similarity is why these sametechniques are incomplete. The road-block can be overcome by viewingthe language graph (and graphs in general) as being constructed outof connected parts, with explicitly-named connectors. Thisconnector-based approach to the definition of a graph just so happensto obey the axioms of a sheaf (as commonly understood in sheaf theory).This connector-based approach also indicates how vector spaces can beassembled, connected together to form the graph from which they arebeing sampled. The code here is an exploration and implementation ofthis insight.
An very early and yet pretty presentation of algebraic linguistics,as seen axiomatically, is given by Solomon Marcus,“Algebraic Linguistics; Analytical Models”,(1967), Elsevier.
A feels-right explanation of how one extracts semantics from disjuncts isgiven by EA Nida,“The Molecular Level of Lexical Semantics”,(1997)International Journal of Lexicography,10(4): 265–274.What Nida is saying can, in fact, be measured, by correlating, forexample, disjuncts with WordNet word-senses. The correlation is real andis measurable (and has been measured). The goal of this project is tomove beyond this.
In 2019 we realized that training on English corpora does not offersufficient control to measure the quality of the learning algorithm.Thus, we've devised a new approach: create a random grammar, createa corpus of sentences from that random grammar, learn a grammar fromthat corpus, and validate that the learned grammar accurately matchesthe input grammar. Doing this will allow the learning algorithm tobe correctly calibrated for grammars of different sizes andcomplexities, and for corpora of different sizes. We will be able tomeasure how accuracy scales as a function of training time, how welldifferent training algorithms perform, how large a corpus is need toget good results, and other related questions.
As of May 2021, a basic infrastructure for doing the above has been setup and mostly automated. Some initial calibration runs have beenperformed. A multitude of difficult theoretical questions were promptlyexposed, seeREADME-Calibration for these.The primary issue is that it seems like it's very easy to generatecomplex and chaotic grammars which produce ergodic corpora. There is nostructure to something that is ergodic, so attempting to extract agrammar is like trying to find structure in white noise. Theseartificial grammars do not resemble those of natural human languages,unless the parameters are very carefully chosen. Judging this andevaluating results remains challenging.
Please contact via email or discord opencog chat for details.
It seems like the above concepts (from the "medium idea" section) shouldalso work "just fine" for images (pixels in images). That's because a2D image is "just like" a 1-D stream of words, in that there are obviousneighbors, and nearby pixels are correlated. The pixel-pair-MI is astand-in for an edge detector; the minimum-spanning-tree just picks outall of the nearby, inter-related image-parts (areas of the image thatfrequently co-occur), and the grammar provides the part-whole hierarchyfor the image. Thus, it "seems like it should work".
The existing code would choke and be unbearably slow; thus, lots ofbrand-new coded is needed to handle pixels. Interested collaboratorsneed to raise their hands! This task is suitable for programmers/codersfamiliar with image processing techniques; the more abstract parts ofthe theory should not interfere with the more mundane tasks. Note,however: this is a risky endeavor: it seems like it should work, butthere is no guarantee. Competing with conventional neural-nettechnology is hard.
A different problem that this framework should be able to tackle is theunderstanding coordinated motion. Consider, for example, the movement offootball players on a field: based on their movements, one can predictwhat kind of play might be made. Similarly, the movements of a ballerina
- the hand, elbow, hip, knee movements are also correlated and can beclassified into "dance moves". This problem is unlike image recognition,because one is given the positions of the players on the field, or thepositions of the dancers limbs, rather than being given a 2D pixelatedimage. Despite these differences, the overall algorithm "should stillwork" on this data. (It seems like it should work! But what do we know?)
The basic rules of mechanical engineering -- inclined planes, levers,screws, rope and pulleys -- assemble in a form of a grammar. Can thesebe discovered from observation?
Consider children's toy blocks: blocks and cylinders, columns, pyramidsand arches. These have rules for stacking, for example, cylinders roll,and cannot be stacked unless upright. These can be considered to begrammatical rules, describing the range of allowed, allowablecombinations that give stable structures. The "corpus" is thecollection of all allowed, stable structures. Can the grammar of blockstacking, the grammar of mechanical engineering be learned from thecorpus of stable structures?
Can it be learned incrementally? Can it be learned by playing? How mightthis be done? Is the current theory suitable for this, and, if not, whynot? Can the current theory be amended so that it is suitable forlearning mechanical engineering?
See theREADME-Natural file for a description ofthe "open-loop" (un-calibrated) processing system. It describes theprocessing steps in detail. Getting good results requires tuninga variety of parameters, and so careful monitoring is required.
See theREADME-Natural-v2 file for the new"integrated", "continuous learning" pipeline. Under development.
TheREADME-Calibration proposed a techniquefor calibrating the learning pipeline, by generating artificial languageswith known, bounded statistical properties, then learning them, and thenmeasuring the accuracy of the learned language vs. the generated artificiallanguage. This approach is currently abandoned. Its not a bad idea, justthat naive conception won't work.
A quick overview:
download - code for downloading sample corpora off the intertubes.
fake - code for generating artificial grammars.
learn-lang-diary - diary and notes and papersdescribing results and theory.
run - scripts for running the learning pipeline.
run-config - configuration parameters for the learning pipeline.
run-common - generic scripts used in multiple different steps.
scm - the code that actually does all the work.
tests - unit tests. These should pass. Run
make test
.attic - old datasets and old experimental runs.
scm/attic - old source code and instruments.
All the "heavy lifting" is done in the OpenCogAtomSpace. The AtomSpace is agraph database for working with typed (hyper-)graphs. (Typed hypergraphsare ideal for storing very abstract kinds of knowledge.) The AtomSpacecan be manipulated through Atomese, Python, C++, Haskell and Schemebindings. This project glues all of the parts together with Scheme(guile). If you are a Python fan,sorry! The goal here is rapid prototyping, easy experimentation, rapidreconfiguration/redesign. For that, Scheme is just simpler, better,faster. (At least, for me.)
Long-term, the best and finest algorithms will probably be re-writtenin C++ (for speed), and exported with Atomese, Python, Haskell, Scheme(pick your favorite) bindings. This migration process should happen aswe gain understanding of what the problem is, and what reasonablesolutions look like.
This is ascience project. The goal is to determine how things work,run experiments, create and refine new algorithms. Thus, the code inthis repo is organized like a science lab: stuff laying around ina bit of a jumble, sometimes connected and working, and sometimes not.Sometimes with instructions and an operating manual, and sometimes not.If you're a scientist, you're used to this. If you're a softwareengineer, you might find all this to just be a vast incomprehensiblemess of cryptic code. Sorry about that! This is a permanentconstruction site, evolving and changing regularly.
Yes, the code "works". It works the same way any science experimentworks: you turn it on, things happen, and its up to you to figure outwhat that means. What to do with the results. If you were expectingan intelligent AI that you can crack jokes with, sorry, we don't havethat here.
THE END.