Movatterモバイル変換

Language model

From Wikipedia, the free encyclopedia

Statistical model of language

Alanguage model is amodel of the human brain's ability to producenatural language.^[1]^[2] Language models are useful for a variety of tasks, includingspeech recognition,^[3]machine translation,^[4]natural language generation (generating more human-like text),optical character recognition,route optimization,^[5]handwriting recognition,^[6]grammar induction,^[7] andinformation retrieval.^[8]^[9]

Large language models (LLMs), currently their most advanced form^{[as of?]}, are predominantly based ontransformers trained on larger datasets (frequently using textsscraped from the publicinternet). They have supersededrecurrent neural network-based models, which had previously superseded the purely statistical models, such as thewordn-gram language model.

History

[edit]

Noam Chomsky did pioneering work on language models in the 1950s by developing a theory offormal grammars.^[10]

In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations likewordn-gram language models, with probabilities for discrete combinations of words, made significant advances.

In the 2000s, continuous representations for words, such asword embeddings, began to replace discrete representations.^[11] Typically, the representation is areal-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.

Pure statistical models

[edit]

In 1980, the first significant statistical language model was proposed, and during the decade IBM performed 'Shannon-style' experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.^[12]

Models based on wordn-grams

[edit]

This section is an excerpt fromWord n-gram language model.[edit]

Awordn-gram language model is a statistical model of language which calculates the probability of the next word in a sequence from a fixed size window of previous words. If one previous word is considered, it is a bigram model; if two words, a trigram model; ifn − 1 words, ann-gram model.^[13]

Special tokens are introduced to denote the start and end of a sentence $\langle s\rangle$ and $\langle /s\rangle$ . To prevent a zero probability being assigned to unseen words, the probability of each seen word is slightly lowered to make room for the unseen words in a givencorpus. To achieve this, varioussmoothing methods are used, from simple "add-one" smoothing (assigning a count of 1 to unseenn-grams, as anuninformative prior) to more sophisticated techniques, such asGood–Turing discounting orback-off models.

Wordn-gram models have largely been superseded byrecurrent neural network–based models, which in turn have been superseded byTransformer-based models often referred to aslarge language models.^[14]

Exponential

[edit]

Maximum entropy language models encode the relationship between a word and then-gram history using feature functions. The equation is

$P(w_{m}\mid w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))$

where $Z(w_{1},\ldots ,w_{m-1})$ is thepartition function, $a {\displaystyle a}$ is the parameter vector, and $f(w_{1},\ldots ,w_{m})$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certainn-gram. It is helpful to use a prior on $a {\displaystyle a}$ or some form ofregularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

[edit]

This section is an excerpt fromWord n-gram language model § Skip-gram language model.[edit]

1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"

Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. wordn-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that areskipped over (thus the name "skip-gram").^[15]

Formally, ak-skip-n-gram is a length-n subsequence where the components occur at distance at mostk from each other.

For example, in the input text:

the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

the in,rain Spain,in falls,Spain mainly,falls on,mainly the, andon plain.

In skip-gram model, semantic relations between words are represented bylinear combinations, capturing a form ofcompositionality. For example, in some such models, ifv is the function that maps a wordw to itsn-d vector representation, then

$v(\mathrm {king} )-v(\mathrm {male} )+v(\mathrm {female} )\approx v(\mathrm {queen} )$

where ≈ is made precise by stipulating that its right-hand side must be thenearest neighbor of the value of the left-hand side.^[16]^[17]

Neural models

[edit]

Recurrent neural network

[edit]

Continuous representations orembeddings of words are produced inrecurrent neural network-based language models (known also ascontinuous space language models).^[18] Such continuous space embeddings help to alleviate thecurse of dimensionality, which is the consequence of the number of possible sequences of words increasingexponentially with the size of the vocabulary, further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.^[19]

Large language models

[edit]

This section is an excerpt fromLarge language model.[edit]

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Alarge language model (LLM) is a language model trained withself-supervised machine learning on a vast amount of text, designed fornatural language processing tasks, especiallylanguage generation.^[20]^[21] The largest and most capable LLMs aregenerative pre-trainedtransformers (GPTs) and provide the core capabilities ofchatbots such asChatGPT,Gemini,Perplexity andClaude. LLMs can befine-tuned for specific tasks or guided byprompt engineering.^[22] These models acquirepredictive power regardingsyntax,semantics, andontologies^[23] inherent in humanlanguage corpora, but they also inherit inaccuracies andbiases present in thedata they are trained on.^[24]

They consist of billions to trillions ofparameters and operate as general-purpose sequence models, generating, summarizing, translating, and reasoning over text. LLMs represent a significant new technology in their ability to generalize across tasks with minimal task-specific supervision, enabling capabilities likeconversational agents,code generation,knowledge retrieval, andautomated reasoning that previously required bespoke systems.^[25]

LLMs evolved from earlierstatistical andrecurrent neural network approaches to language modeling. Thetransformer architecture, introduced in 2017, replaced recurrence withself-attention, allowing efficientparallelization, longer context handling, and scalable training on unprecedented data volumes.^[26] This innovation enabled models likeGPT,BERT, and their successors, which demonstratedemergent behaviors at scale such asfew-shot learning and compositional reasoning.^[27]

Reinforcement learning, particularlypolicy gradient algorithms, has been adapted tofine-tune LLMs for desired behaviors beyond raw next-token prediction.^[28]Reinforcement learning from human feedback (RLHF) applies these methods to optimize a policy, the LLM's output distribution, against reward signals derived from human or automated preference judgments.^[29] This has been critical for aligning model outputs with user expectations, improving factuality, reducing harmful responses, and enhancing task performance.

Benchmark evaluations for LLMs have evolved from narrow linguistic assessments toward comprehensive,multi-task evaluations measuringreasoning,factual accuracy,alignment, andsafety.^[30]^[31]Hill climbing, iteratively optimizing models against benchmarks, has emerged as a dominant strategy, producing rapid incremental performance gains but raising concerns ofoverfitting to benchmarks rather than achieving genuinegeneralization or robust capability improvements.^[32]

Although sometimes matching human performance, it is not clear whether they are plausiblecognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.^[33]

Evaluation and benchmarks

[edit]

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.^[34]

Various data sets have been developed for use in evaluating language processing systems.^[35] These include:

Massive Multitask Language Understanding (MMLU)^[36]
Corpus of Linguistic Acceptability^[37]
GLUE benchmark^[38]
Microsoft Research Paraphrase Corpus^[39]
Multi-Genre Natural Language Inference
Question Natural Language Inference
Quora Question Pairs^[40]
Recognizing Textual Entailment^[41]
Semantic Textual Similarity Benchmark
SQuAD question answering Test^[42]
Stanford SentimentTreebank^[43]
Winograd NLI
BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs^[44]

References

[edit]

^Blank, Idan A. (November 2023)."What are large language models supposed to model?".Trends in Cognitive Sciences.27 (11):987–989.doi:10.1016/j.tics.2023.08.006.PMID 37659920."LLMs are supposed to model how utterances behave."
^Jurafsky, Dan; Martin, James H. (2021)."N-gram Language Models"(PDF).Speech and Language Processing (3rd ed.).Archived from the original on 22 May 2022. Retrieved24 May 2022.
^Kuhn, Roland, and Renato De Mori (1990)."A cache-based natural language model for speech recognition".IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
^Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)."Semantic parsing as machine translation"Archived 15 August 2020 at theWayback Machine. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
^Liu, Yang; Wu, Fanyou; Liu, Zhiyuan; Wang, Kai; Wang, Feiyue; Qu, Xiaobo (2023)."Can language models be used for real-world urban-delivery route optimization?".The Innovation.4 (6) 100520.Bibcode:2023Innov...400520L.doi:10.1016/j.xinn.2023.100520.PMC 10587631.PMID 37869471.
^Pham, Vu, et al (2014)."Dropout improves recurrent neural networks for handwriting recognition"Archived 11 November 2020 at theWayback Machine. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
^Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018)."Grammar induction with neural language models: An unusual replication"Archived 14 August 2022 at theWayback Machine.arXiv:1808.10000.
^Ponte, Jay M.; Croft, W. Bruce (1998).A language modeling approach to information retrieval. Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281.doi:10.1145/290941.291008.
^Hiemstra, Djoerd (1998).A linguistically motivated probabilistically model of information retrieval. Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584.doi:10.1007/3-540-49653-X_34.
^Chomsky, N. (September 1956). "Three models for the description of language".IRE Transactions on Information Theory.2 (3):113–124.Bibcode:1956IRTIT...2..113C.doi:10.1109/TIT.1956.1056813.ISSN 2168-2712.
^"The Nature Of Life, The Nature Of Thinking: Looking Back On Eugene Charniak's Work And Life". 22 February 2022.Archived from the original on 3 November 2024. Retrieved5 February 2025.
^Rosenfeld, Ronald (2000)."Two decades of statistical language modeling: Where do we go from here?".Proceedings of the IEEE.88 (8):1270–1278.Bibcode:2000IEEEP..88.1270R.doi:10.1109/5.880083.S2CID 10959945.
^Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models".Speech and Language Processing(PDF) (3rd edition draft ed.). Retrieved24 May 2022.
^Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (1 March 2003)."A neural probabilistic language model".The Journal of Machine Learning Research.3:1137–1155 – via ACM Digital Library.
^David Guthrie; et al. (2006)."A Closer Look at Skip-gram Modelling"(PDF). Archived fromthe original(PDF) on 17 May 2017. Retrieved27 April 2014.
^Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space".arXiv:1301.3781 [cs.CL].
^Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013).Distributed Representations of Words and Phrases and their Compositionality(PDF).Advances in Neural Information Processing Systems. pp. 3111–3119.Archived(PDF) from the original on 29 October 2020. Retrieved22 June 2015.
^Karpathy, Andrej."The Unreasonable Effectiveness of Recurrent Neural Networks".Archived from the original on 1 November 2020. Retrieved27 January 2019.
^Bengio, Yoshua (2008)."Neural net language models".Scholarpedia. Vol. 3. p. 3881.Bibcode:2008SchpJ...3.3881B.doi:10.4249/scholarpedia.3881.Archived from the original on 26 October 2020. Retrieved28 August 2015.
^Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Matthew; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine;Brunskill, Emma (2021). "On the Opportunities and Risks of Foundation Models".arXiv:2108.07258.{{cite journal}}:Cite journal requires|journal= (help)
^Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda (2020). "Language Models are Few-Shot Learners".arXiv:2005.14165 [cs.CL].
^Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (December 2020). Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.)."Language Models are Few-Shot Learners"(PDF).Advances in Neural Information Processing Systems.33. Curran Associates, Inc.:1877–1901.arXiv:2005.14165.doi:10.1145/3582269.3615599.Archived(PDF) from the original on 17 November 2023. Retrieved14 March 2023.
^Fathallah, Nadeen; Das, Arunav; De Giorgis, Stefano; Poltronieri, Andrea; Haase, Peter; Kovriguina, Liubov (26 May 2024).NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning(PDF). Extended Semantic Web Conference 2024. Hersonissos, Greece.
^Manning, Christopher D. (2022)."Human Language Understanding & Reasoning".Daedalus.151 (2):127–138.doi:10.1162/daed_a_01905.S2CID 248377870.Archived from the original on 17 November 2023. Retrieved9 March 2023.
^Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models".arXiv:2001.08361 [cs.LG].
^Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need".arXiv:1706.03762 [cs.CL].
^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805 [cs.CL].
^Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences".arXiv:1706.03741 [stat.ML].
^Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex (2022). "Training language models to follow instructions with human feedback".arXiv:2203.02155 [cs.CL].
^Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".arXiv:1804.07461 [cs.CL].
^Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2025). "Expressing stigma and inappropriate responses prevents LLMS from safely replacing mental health providers".Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 599–627.arXiv:2009.03300.doi:10.1145/3715275.3732039.ISBN 979-8-4007-1482-5.
^Recht, Benjamin; Roelofs, Rebecca; Schmidt, Ludwig; Shankar, Vaishaal (2019). "Do ImageNet Classifiers Generalize to ImageNet?".arXiv:1902.10811 [cs.CV].
^Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (9 January 2018).Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG.ISBN 978-1-5015-0692-5.Archived from the original on 16 April 2023. Retrieved11 December 2021.
^Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations",International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260,doi:10.1007/978-3-319-64206-2_8,ISBN 978-3-319-64205-5
^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (10 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805 [cs.CL].
^Hendrycks, Dan (14 March 2023),Measuring Massive Multitask Language Understanding,archived from the original on 15 March 2023, retrieved15 March 2023
^"The Corpus of Linguistic Acceptability (CoLA)".nyu-mll.github.io.Archived from the original on 7 December 2020. Retrieved25 February 2019.
^"GLUE Benchmark".gluebenchmark.com.Archived from the original on 4 November 2020. Retrieved25 February 2019.
^"Microsoft Research Paraphrase Corpus".Microsoft Download Center.Archived from the original on 25 October 2020. Retrieved25 February 2019.
^Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset",Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 10415, Springer International Publishing, pp. 66–73,doi:10.1007/978-3-319-64206-2_8,ISBN 978-3-319-64205-5
^Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan."Recognizing Textual Entailment"(PDF). Archived fromthe original(PDF) on 9 August 2017. Retrieved24 February 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
^"The Stanford Question Answering Dataset".rajpurkar.github.io.Archived from the original on 30 October 2020. Retrieved25 February 2019.
^"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank".nlp.stanford.edu.Archived from the original on 27 October 2020. Retrieved25 February 2019.
^"llama/MODEL_CARD.md at main · meta-llama/llama".GitHub. Retrieved28 December 2024.

Movatterモバイル変換

Language model

History

Pure statistical models

Models based on wordn-grams

Exponential

Skip-gram model

Neural models

Recurrent neural network

Large language models

Evaluation and benchmarks

See also

References

Further reading