In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations likewordn-gram language models, with probabilities for discrete combinations of words, made significant advances.
In the 2000s, continuous representations for words, such asword embeddings, began to replace discrete representations.[11] Typically, the representation is areal-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.
In 1980, the first significant statistical language model was proposed, and during the decade IBM performed 'Shannon-style' experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.[12]
Awordn-gram language model is a statistical model of language which calculates the probability of the next word in a sequence from a fixed size window of previous words. If one previous word is considered, it is a bigram model; if two words, a trigram model; ifn − 1 words, ann-gram model.[13]
Special tokens are introduced to denote the start and end of a sentence and. To prevent a zero probability being assigned to unseen words, the probability of each seen word is slightly lowered to make room for the unseen words in a givencorpus. To achieve this, varioussmoothing methods are used, from simple "add-one" smoothing (assigning a count of 1 to unseenn-grams, as anuninformative prior) to more sophisticated techniques, such asGood–Turing discounting orback-off models.
Maximum entropy language models encode the relationship between a word and then-gram history using feature functions. The equation is
where is thepartition function, is the parameter vector, and is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certainn-gram. It is helpful to use a prior on or some form ofregularization.
The log-bilinear model is another example of an exponential language model.
1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"
Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. wordn-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that areskipped over (thus the name "skip-gram").[15]
Formally, ak-skip-n-gram is a length-n subsequence where the components occur at distance at mostk from each other.
For example, in the input text:
the rain in Spain falls mainly on the plain
the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences
the in,rain Spain,in falls,Spain mainly,falls on,mainly the, andon plain.
In skip-gram model, semantic relations between words are represented bylinear combinations, capturing a form ofcompositionality. For example, in some such models, ifv is the function that maps a wordw to itsn-d vector representation, then
where ≈ is made precise by stipulating that its right-hand side must be thenearest neighbor of the value of the left-hand side.[16][17]
Continuous representations orembeddings of words are produced inrecurrent neural network-based language models (known also ascontinuous space language models).[18] Such continuous space embeddings help to alleviate thecurse of dimensionality, which is the consequence of the number of possible sequences of words increasingexponentially with the size of the vocabulary, further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.[19]
They consist of billions to trillions ofparameters and operate as general-purpose sequence models, generating, summarizing, translating, and reasoning over text. LLMs represent a significant new technology in their ability to generalize across tasks with minimal task-specific supervision, enabling capabilities likeconversational agents,code generation,knowledge retrieval, andautomated reasoning that previously required bespoke systems.[25]
Reinforcement learning, particularlypolicy gradient algorithms, has been adapted tofine-tune LLMs for desired behaviors beyond raw next-token prediction.[28]Reinforcement learning from human feedback (RLHF) applies these methods to optimize a policy, the LLM's output distribution, against reward signals derived from human or automated preference judgments.[29] This has been critical for aligning model outputs with user expectations, improving factuality, reducing harmful responses, and enhancing task performance.
Benchmark evaluations for LLMs have evolved from narrow linguistic assessments toward comprehensive,multi-task evaluations measuringreasoning,factual accuracy,alignment, andsafety.[30][31]Hill climbing, iteratively optimizing models against benchmarks, has emerged as a dominant strategy, producing rapid incremental performance gains but raising concerns ofoverfitting to benchmarks rather than achieving genuinegeneralization or robust capability improvements.[32]
Although sometimes matching human performance, it is not clear whether they are plausiblecognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.[33]
Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.[34]
Various data sets have been developed for use in evaluating language processing systems.[35] These include:
^Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)."Semantic parsing as machine translation"Archived 15 August 2020 at theWayback Machine. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
^Ponte, Jay M.; Croft, W. Bruce (1998).A language modeling approach to information retrieval. Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281.doi:10.1145/290941.291008.
^Hiemstra, Djoerd (1998).A linguistically motivated probabilistically model of information retrieval. Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584.doi:10.1007/3-540-49653-X_34.
^Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models".Speech and Language Processing(PDF) (3rd edition draft ed.). Retrieved24 May 2022.
^Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (1 March 2003)."A neural probabilistic language model".The Journal of Machine Learning Research.3:1137–1155 – via ACM Digital Library.
^Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Matthew; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine;Brunskill, Emma (2021). "On the Opportunities and Risks of Foundation Models".arXiv:2108.07258.{{cite journal}}:Cite journal requires|journal= (help)
^Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models".arXiv:2001.08361 [cs.LG].
^Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need".arXiv:1706.03762 [cs.CL].
^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805 [cs.CL].
^Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences".arXiv:1706.03741 [stat.ML].
^Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex (2022). "Training language models to follow instructions with human feedback".arXiv:2203.02155 [cs.CL].
^Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".arXiv:1804.07461 [cs.CL].
^Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2025). "Expressing stigma and inappropriate responses prevents LLMS from safely replacing mental health providers".Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 599–627.arXiv:2009.03300.doi:10.1145/3715275.3732039.ISBN979-8-4007-1482-5.
^Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations",International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260,doi:10.1007/978-3-319-64206-2_8,ISBN978-3-319-64205-5
^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (10 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805 [cs.CL].
^Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset",Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 10415, Springer International Publishing, pp. 66–73,doi:10.1007/978-3-319-64206-2_8,ISBN978-3-319-64205-5
^Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan."Recognizing Textual Entailment"(PDF). Archived fromthe original(PDF) on 9 August 2017. Retrieved24 February 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
Jay M. Ponte; W. Bruce Croft (1998). "A Language Modeling Approach to Information Retrieval".Research and Development in Information Retrieval. pp. 275–281.CiteSeerX10.1.1.117.4237.doi:10.1145/290941.291008.
Fei Song; W. Bruce Croft (1999). "A General Language Model for Information Retrieval".Research and Development in Information Retrieval. pp. 279–280.CiteSeerX10.1.1.21.6467.doi:10.1145/319950.320022.