
ABoltzmann machine (also calledSherrington–Kirkpatrick model with external field orstochastic Ising model), named afterLudwig Boltzmann, is aspin-glass model with an external field, i.e., aSherrington–Kirkpatrick model,[1] that is a stochasticIsing model. It is astatistical physics technique applied in the context ofcognitive science.[2] It is also classified as aMarkov random field.[3]
Boltzmann machines are theoretically intriguing because of the locality andHebbian nature of their training algorithm (being trained by Hebb's rule), and because of theirparallelism and the resemblance of their dynamics to simplephysical processes. Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems inmachine learning orinference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.[4]
They are named after theBoltzmann distribution instatistical mechanics, which is used in theirsampling function. They were heavily popularized and promoted byGeoffrey Hinton,Terry Sejnowski andYann LeCun in cognitive sciences communities, particularly inmachine learning,[2] as part of "energy-based models" (EBM), becauseHamiltonians ofspin glasses as energy are used as a starting point to define the learning task.[5]

A Boltzmann machine, like aSherrington–Kirkpatrick model, is a network of units with a total "energy" (Hamiltonian) defined for the overall network. Its units producebinary results. Boltzmann machine weights arestochastic. The global energy in a Boltzmann machine is identical in form to that ofHopfield networks andIsing models:
Where:
Often the weights are represented as a symmetric matrix with zeros along the diagonal.
The difference in the global energy that results from a single unit equaling 0 (off) versus 1 (on), written, assuming a symmetric matrix of weights, is given by:
This can be expressed as the difference of energies of two states:
Substituting the energy of each state with its relative probability according to theBoltzmann factor(the property of aBoltzmann distribution that the energy of a state is proportional to the negative log probability of that state)yields:
where is theBoltzmann constant and is absorbed into the artificial notion of temperature.Noting that the probabilities of the unit beingon oroff sum to allows for the simplification:
whence the probability that the-th unit is given by
where thescalar is referred to as thetemperature of the system.This relation is the source of thelogistic function found in probability expressions in variants of the Boltzmann machine.
The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to aBoltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "atthermal equilibrium", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching athermal equilibrium at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is calledsimulated annealing.
To train the network so that the chance it will converge to a global state according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.
The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. thetraining set is a set of binary vectors over the set V. The distribution over the training set is denoted.
The distribution over global states converges as the Boltzmann machine reachesthermal equilibrium. We denote this distribution, after wemarginalize it over the hidden units, as.
Our goal is to approximate the "real" distribution using the produced by the machine. The similarity of the two distributions is measured by theKullback–Leibler divergence,:
where the sum is over all the possible states of. is a function of the weights, since they determine the energy of a state, and the energy determines, as promised by the Boltzmann distribution. Agradient descent algorithm over changes a given weight,, by subtracting thepartial derivative of with respect to the weight.
Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight,, is given by the equation:[2]
where:
This result follows from the fact that atthermal equilibrium the probability of any global state when the network is free-running is given by the Boltzmann distribution.
This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (synapse, biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such asbackpropagation.
The training of a Boltzmann machine does not use theEM algorithm, which is heavily used inmachine learning. By minimizing theKL-divergence, it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
Theoretically the Boltzmann machine is a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example,complete a partial photograph.
Unfortunately, Boltzmann machines experience a serious practical problem, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial size.[citation needed] This is due to important effects, specifically:

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in a restricted Boltzmann machine (RBM) which does not allow intralayer connections between hidden units and visible units, i.e. there is no connection between visible to visible and hidden to hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most commondeep learning strategies. As each new layer is added the generative model improves.
An extension to the restricted Boltzmann machine allows using real valued data rather than binary data.[6]
One example of a practical RBM application is in speech recognition.[7]
A deep Boltzmann machine (DBM) is a type of binary pairwiseMarkov random field (undirected probabilisticgraphical model) with multiple layers ofhiddenrandom variables. It is a network of symmetrically coupled stochasticbinary units. It comprises a set of visible units and layers of hidden units. No connection links units of the same layer (likeRBM). For theDBM, the probability assigned to vectorν is
where are the set of hidden units, and are the model parameters, representing visible-hidden and hidden-hidden interactions.[8] In aDBN only the top two layers form a restricted Boltzmann machine (which is an undirectedgraphical model), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected.
LikeDBNs, DBMs can learn complex and abstract internal representations of the input in tasks such asobject orspeech recognition, using limited, labeled data to fine-tune the representations built using a large set of unlabeled sensory input data. However, unlike DBNs and deepconvolutional neural networks, they pursue the inference and training procedure in both directions, bottom-up and top-down, which allow the DBM to better unveil the representations of the input structures.[9][10][11]
However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by usingMarkov chain Monte Carlo (MCMC).[8] This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation.
The need for deep learning withreal-valued inputs, as inGaussian RBMs, led to the spike-and-slabRBM (ssRBM), which models continuous-valued inputs withbinarylatent variables.[12] Similar to basicRBMs and its variants, a spike-and-slab RBM is abipartite graph, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discreteprobability mass at zero, while a slab is adensity over continuous domain;[13] their mixture forms aprior.[14]
An extension of ssRBM called μ-ssRBM provides extra modeling capacity using additional terms in theenergy function. One of these terms enables the model to form aconditional distribution of the spike variables bymarginalizing out the slab variables given an observation.
In more general mathematical setting, the Boltzmann distribution is also known as theGibbs measure. Instatistics andmachine learning it is called alog-linear model. Indeep learning the Boltzmann distribution is used in the sampling distribution ofstochastic neural networks such as the Boltzmann machine.
The Boltzmann machine is based on the Sherrington–Kirkpatrickspin glass model byDavid Sherrington andScott Kirkpatrick.[15] The seminal publication byJohn Hopfield (1982) applied methods of statistical mechanics, mainly the recently developed (1970s) theory of spin glasses, to studyassociative memory (later named the "Hopfield network").[16]
The original contribution in applying such energy-based models in cognitive science appeared in papers byGeoffrey Hinton andTerry Sejnowski.[17][18][19] In a 1995 interview, Hinton stated that in 1983 February or March, he was going to give a talk onsimulated annealing in Hopfield networks, so he had to design a learning algorithm for the talk, resulting in the Boltzmann machine learning algorithm.[20]
The idea of applying the Ising model with annealedGibbs sampling was used inDouglas Hofstadter'sCopycat project (1984).[21][22]
The explicit analogy drawn with statistical mechanics in the Boltzmann machine formulation led to the use of terminology borrowed from physics (e.g., "energy"), which became standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the adoption of a variety of concepts and methods from statistical mechanics. The various proposals to use simulated annealing for inference were apparently independent.
Similar ideas (with a change of sign in the energy function) are found inPaul Smolensky's "Harmony Theory".[23] Ising models can be generalized toMarkov random fields, which find widespread application inlinguistics,robotics,computer vision andartificial intelligence.
In 2024, Hopfield and Hinton were awardedNobel Prize in Physics for their foundational contributions tomachine learning, such as the Boltzmann machine.[24]