This article needs to beupdated. Please help update this article to reflect recent events or newly available information.(April 2023) |
Thehistory of natural language processing describes the advances ofnatural language processing. There is some overlap with thehistory of machine translation, thehistory of speech recognition, and thehistory of artificial intelligence.
The history of machine translation dates back to the seventeenth century, when philosophers such asLeibniz andDescartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine.
The first patents for "translating machines" were applied for in the mid-1930s. One proposal, byGeorges Artsrouni, was simply an automatic bilingual dictionary usingpaper tape. The other proposal, byPeter Troyanskii, a Russian, was more detailed. Troyanskii’s proposal included both the bilingual dictionary and a method for dealing with grammatical roles between languages, based onEsperanto.[1][2]
In 1950,Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called theTuring test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human.
In 1957,Noam Chomsky’sSyntactic Structures revolutionized Linguistics with 'universal grammar', a rule-based system of syntactic structures.[3]
TheGeorgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[4] However, real progress was much slower, and after theALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the firststatistical machine translation systems were developed.
Some notably successful NLP systems developed in the 1960s wereSHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies.
In 1969Roger Schank introduced theconceptual dependency theory for natural language understanding.[5] This model, partially influenced by the work ofSydney Lamb, was extensively used by Schank's students atYale University, such as Robert Wilensky, Wendy Lehnert, andJanet Kolodner.
In 1970, William A. Woods introduced theaugmented transition network (ATN) to represent natural language input.[6] Instead ofphrase structure rules ATNs used an equivalent set offinite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, manychatterbots were written includingPARRY,Racter, andJabberwacky.
Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction ofmachine learning algorithms for language processing. This was due both to the steady increase in computational power resulting fromMoore's law and the gradual lessening of the dominance ofChomskyan theories of linguistics (e.g.transformational grammar), whose theoretical underpinnings discouraged the sort ofcorpus linguistics that underlies the machine-learning approach to language processing.[7] Some of the earliest-used machine learning algorithms, such asdecision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused onstatistical models, which make soft,probabilistic decisions based on attachingreal-valued weights to the features making up the input data. Thecache language models upon which manyspeech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
The emergence of statistical approaches was aided by both increase in computing power and the availability of large datasets. At that time, large multilingual corpora were starting to emerge. Notably, some were produced by theParliament of Canada and theEuropean Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.
Many of the notable early successes occurred in the field ofmachine translation. In 1993, theIBM alignment models were used forstatistical machine translation.[8] Compared to previous machine translation systems, which were symbolic systems manually coded by computational linguists, these systems were statistical, which allowed them to automatically learn from largetextual corpora. Though these systems do not work well in situations where only small corpora is available, so data-efficient methods continue to be an area of research and development.
In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for worddisambiguation.[9]
To take advantage of large, unlabelled datasets, algorithms were developed forunsupervised andself-supervised learning. Generally, this task is much more difficult thansupervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of theWorld Wide Web), which can often make up for the inferior results.

Neurallanguage models were developed in 1990s. In 1990, theElman network, using arecurrent neural network, encoded each word in a training set as a vector, called aword embedding, and the whole vocabulary as avector database, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simplemultilayer perceptron. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings ofhomonyms.[10]
Yoshua Bengio developed the first neural probabilistic language model in 2000.[11] Novel algorithms, availability of larger datasets and higher processing power made possible training of larger and larger language models.
Attention mechanism was introduced by Bahdanau et al. in 2014.[12] This work laid the foundations for the famous "Attention is All You Need" paper[13] that introduced theTransformer architecture in 2017. The concept oflarge language model (LLM) emerged in late 2010s. LLM is a language model trained with self-supervised learning on vast amount of text. Earliest public LLMs had hundreds of millions of parameters[14], but this number quickly rose to billion and even trillions.[15]
In recent years, advancements in deep learning and large language models have significantly enhanced the capabilities of natural language processing, leading to widespread applications in areas such as healthcare, customer service, and content generation.[16]
| Software | Year | Creator | Description | Ref. |
|---|---|---|---|---|
| Georgetown experiment | 1954 | Georgetown University andIBM | involved fully automatic translation of more than sixty Russian sentences into English. | |
| STUDENT | 1964 | Daniel Bobrow | could solve high school algebra word problems.[17] | |
| ELIZA | 1964 | Joseph Weizenbaum | a simulation of aRogerian psychotherapist, rephrasing her response with a few grammar rules.[18] | |
| SHRDLU | 1970 | Terry Winograd | a natural language system working in restricted "blocks worlds" with restricted vocabularies, worked extremely well | |
| PARRY | 1972 | Kenneth Colby | Achatterbot | |
| KL-ONE | 1974 | Sondheimer et al. | a knowledge representation system in the tradition ofsemantic networks and frames; it is aframe language. | |
| MARGIE | 1975 | Roger Schank | ||
| TaleSpin (software) | 1976 | Meehan | ||
| QUALM | Lehnert | |||
| LIFER/LADDER | 1978 | Hendrix | a natural language interface to a database of information about US Navy ships. | |
| SAM (software) | 1978 | Cullingford | ||
| PAM (software) | 1978 | Robert Wilensky | ||
| Politics (software) | 1979 | Carbonell | ||
| Plot Units (software) | 1981 | Lehnert | ||
| Jabberwacky | 1982 | Rollo Carpenter | chatterbot with stated aim to "simulate natural human chat in an interesting, entertaining and humorous manner". | |
| MUMBLE (software) | 1982 | McDonald | ||
| Racter | 1983 | William Chamberlain and Thomas Etter | chatterbot that generated English language prose at random. | |
| MOPTRANS[19] | 1984 | Lytinen | ||
| KODIAK (software) | 1986 | Wilensky | ||
| Absity (software) | 1987 | Hirst | ||
| Dr. Sbaitso | 1991 | Creative Labs | ||
| IBM Watson | 2006 | IBM | A question answering system that won theJeopardy! contest, defeating the best human players in February 2011. | |
| Siri | 2011 | Apple | A virtual assistant developed by Apple. | |
| Cortana | 2014 | Microsoft | A virtual assistant developed by Microsoft. | |
| Amazon Alexa | 2014 | Amazon | A virtual assistant developed by Amazon. | |
| Google Assistant | 2016 | A virtual assistant developed by Google. | ||
| ChatGPT | 2022 | OpenAI | Generative chatbot. |
{{citation}}: CS1 maint: location missing publisher (link)