Probabilities of a random variable's observations are in the range$[0,1]$, whereas log probabilities transform them to the log scale. What then is the corresponding range of log probabilities, i.e. what does a probability of 0 become, and is it the minimum of the range, and what does a probability of 1 become, and is this the maximum of the log probability range? What is the intuition of this being of any practical use compared to$[0,1]$?
I know that log probabilities allow for stable numerical computations such as summation. However, besides arithmetic, how does this transformation improve applications compared to the case where raw probabilities are used instead? A comparative example for a continuous random variable before and after logging would be good
- 5$\begingroup$Log of probability zero is just log of zero as usual, and so indeterminate. That's not fatal to other uses. With entropy we say $p \log p \equiv 0$ whenever $p = 0$, which can be justified more rigorously. Logarithms can be useful to the extent that probabilities multiply, and for other reasons. The logarithm of a probability density can be useful too. So the logarithm of a Gaussian density is a quadratic and other distributions may helpfully be compared on log scale, as that stretches smaller densities. Log of cumulative probability or survival probability can be useful too.$\endgroup$Nick Cox– Nick Cox2020-08-20 14:34:13 +00:00CommentedAug 20, 2020 at 14:34
- 7$\begingroup$New questions deserve ... new questions.$\endgroup$Nick Cox– Nick Cox2020-08-20 14:38:10 +00:00CommentedAug 20, 2020 at 14:38
- 6$\begingroup$No; as said, comments are not for new questions. It's easy enough to find discussions of that. Also, while entropy (Shannon sense) is based on probabilities and log probabilities it is neither a probability nor a log probability.$\endgroup$Nick Cox– Nick Cox2020-08-20 14:45:15 +00:00CommentedAug 20, 2020 at 14:45
- 3$\begingroup$@NickCox "Indeterminate" isn't quite the right word for $\log (0)$. In this context, we can identify the symbol $\log(0)$ with the limit $\lim_{x\to0^+} \log(0) = -\infty$ in a determinate way, meaning that $\log(0) = \log(0)$ "makes sense" in a way that "indeterminate" expressions like 0/0 don't. For example, we can't say that $1/2 = \lim_{x\to0}\frac{x}{2x} = 0/0 = \lim_{x\to0}\frac{x}{x} = 1$ because $0/0$ is indeterminate. The expression $\log(0)$ does not have this ambiguity with the use of the = sign, making it determinate, despite not being a real number.$\endgroup$Eric Perkerson– Eric Perkerson2020-08-21 22:17:44 +00:00CommentedAug 21, 2020 at 22:17
- 2$\begingroup$Hint: a typical problem is to compare Binomial probabilities of $x$ successes in a sample of $n$ when the chances of each success are either $p$ or $q.$ With, say, $x=500,$ $n=1000,$ and $(p,q)=(0.05,0.051)$ (not unusual values), what are these probabilities? (Answers: $0.9\times10^{-835}$ and $0.21\times10^{-825}.$) What is their ratio? (Evidently around $10^{10},$computed using logarithms.) In the usual IEEE double precision representation, what are these probabilities? Answers: $0$ and $0,$ because they are too tiny! What is their ratio? It can't be computed in double precision.$\endgroup$2024-06-21 14:51:46 +00:00CommentedJun 21, 2024 at 14:51
5 Answers5
The log of$1$ is just$0$ and the limit as$x$ approaches$0$ (from the positive side) of$\log x$ is$-\infty$. So the range of values for log probabilities is$(-\infty, 0]$.
The real advantage is in the arithmetic. Log probabilities are not as easy to understand as probabilities (for most people), but every time you multiply together two probabilities (other than$1 \times 1 = 1$), you will end up with a value closer to$0$. Dealing with numbers very close to$0$ can become unstable with finite precision approximations, so working with logs makes things much more stable and in some cases quicker and easier. Why do you need any more justification than that?
- $\begingroup$is there more behind the intuition explaining how a measure that is negative and unbounded has come to be more preferable to one that is already bounded between $[0,1]$?$\endgroup$develarist– develarist2020-08-20 14:34:20 +00:00CommentedAug 20, 2020 at 14:34
- 3$\begingroup$Log scale is not preferable and I don't think @Greg Snow is saying that either. It is just useful as explained.$\endgroup$Nick Cox– Nick Cox2020-08-20 14:37:05 +00:00CommentedAug 20, 2020 at 14:37
- 16$\begingroup$@develarist as the answer already mentions, if you want to represent a very, very, very small probability, then in the commonly used digital representations (e.g. IEEE 754 floating point numbers) it's preferable to store them as huge negative numbers in the logarithmic representation instead of very small positive numbers close to 0 in the direct representation, since in the latter case you'll have larger numerical errors in every calculation caused by the difference between the true value and the closest value that can be represented with the finite precision used in that encoding.$\endgroup$Peteris– Peteris2020-08-20 23:21:20 +00:00CommentedAug 20, 2020 at 23:21
- 5$\begingroup$@develarist Since
log(x*y)=log(x)+log(y), doing calculations of joint probablities in 'log-space' is trivial and numerically accurate.$\endgroup$Peteris– Peteris2020-08-21 15:55:31 +00:00CommentedAug 21, 2020 at 15:55 - 2$\begingroup$@develarist of course, some representation error is unavoidable, however, there's a quantitative difference in how much these errors accumulate over all the calculations you are going to make, since you get some extra representation error in each intermediate value, and so the scale of these (many) intermediate values and the resulting errors matter more than the scale and representation error of the initial input. Often there's a limited need to convert between log-probabilities and actual probabilities, e.g. if I need argmax or beam search, it can be done directly it in log-space.$\endgroup$Peteris– Peteris2020-08-24 20:25:16 +00:00CommentedAug 24, 2020 at 20:25
I would add that taking the log of a probability or probability density can often simplify certain computations, such as calculating the gradient of the density given some of its parameters. This is particularly true when the density belongs to the exponential family, which often contains fewer special function calls after being logged than before. This makes taking the derivative by hand simpler (as product rules become simpler sum rules) and can also lead to more stable numerical derivative calculations, such as finite differencing.
As an illustration, let's take the Poisson with probability function$$f_x=e^{-\lambda}\frac{\lambda^{x}}{x!}.$$ While$x$ is a discrete variable, the function is smooth with respect to$\lambda$. Applying a logarithmic transformation, the function becomes$$\log f_x= -\lambda + x\log(\lambda) - \log(x!),$$ and its derivative with respect to$\lambda$ is$$\frac{\partial \log f_x}{\partial \lambda} = -1 + \frac{x}{\lambda},$$ an expression involving just two simple operations. Now, contrast this result with the partial derivative of$f_x$:$$\frac{\partial f_x}{\partial \lambda} = \frac{e^{-\lambda } (x-\lambda) \lambda ^{x-1}}{x!}:$$ the calculation involves natural exponentiation, real exponentiation, computation of a factorial, and, worst of all, division by a factorial. This involves more computation time and less numerical stability, even in this simple example. The result is compounded for more complex probability functions, as well as when observing an i.i.d sample of random variables since these are added in log space while multiplied in probability space (again, complicating derivative calculation, as well as introducing more of the floating point error mentioned in the other answer).
These gradient expressions are used in both analytic and numerical computation of Maximum a Posteriori ($\ell_0$ Bayes) and Maximum Likelihood Estimators. It is also used in the numerical solution of the Method of Moments estimating equations, often via Newton's method, which involves Hessian computations or second derivatives. Here, the difference between logged and unlogged complexity can be huge. And finally, it is used to show the equivalence between least squares and maximum likelihood with a Gaussian error structure.
- $\begingroup$the log(x!) term should be subtracting and not adding when taking the logarithm of the Poisson probability function$\endgroup$Ruben Garcia– Ruben Garcia2020-08-27 01:32:15 +00:00CommentedAug 27, 2020 at 1:32
As an example of the process mentioned in Greg Snow's answer: I quite often use high-level programming languages (Octave, Maxima[*], Gnuplot, Perl,...) to compute ratios between marginal likelihoods for Bayesian model comparison. If one tries to compute the ratio of marginal likelihoods directly, intermediate steps in the calculation (and sometimes the final result too) very frequently go beyond the capabilities of the floating-point number implementation in the interpreter/compiler, producing numbers so small that the computer can't tell them apart from zero, when all the important information is in the fact that those numbers are actually not quite zero. If, on the other hand, one works in log probabilities throughout, and takes the difference between the logarithms of the marginal likelihoods at the end, this problem is much less likely to occur.
[*]Sometimes, Maxima evades the problem by using rational-number arithmetic instead of floating-point arithmetic, but one can't necessarily rely on this.
This might not be what you are interested in, but log probabilities instatistical physics are closely related to the concepts ofenergy andentropy. For a physical system in equilibrium at temperature$T$ (in kelvin), the difference in energy between two microstates A and B is related to the logarithm of the probabilities that the system is in state A or state B:
$$E_\mathrm{A} - E_\mathrm{B} =-k_\mathrm{B}T \left[ \ln(P_\mathrm{A}) - \ln( P_\mathrm{B}) \right]$$
So, statistical physicists often work with log probabilities (or scaled versions of them), because they are physically meaningful. For example, the potential energy of a gas molecule in an atmosphere at a fixed temperature under a uniform gravitation field (a good approximation near the surface of the Earth) is$mgh$, where$m$ is the mass of the gas molecule,$g$ is the acceleration of gravity, and$h$ is the height of the molecule above the surface. The probability of finding a gas molecule in the top floor of the building versus in the bottom floor (assuming the floors have the same volume and the floor-to-ceiling height is small) is given by:
$$mg (h_\mathrm{top} - h_\mathrm{bottom}) \approx -k_\mathrm{B} T \left[ \ln (P_\mathrm{top}) - \ln(P_\mathrm{bottom}) \right]$$
This probability is trivially related to the concentration of the gas on the two floors. Higher floors have a lower concentration and the concentration of heavier molecules decays more quickly with height.
In statistical physics, it is often useful to switch back and forth between quantities proportional to log probabilities (energy, entropy, enthalpy, free energy) and quantities proportional to probability (number of microstates, partition function, density of states).
When you are reading a book or paper, normally you are dealing with converting one symbolic representation to another, and the details of IEEE 754 and numerical issues should not be a concern. The better justification is that the log of a probability is actually the quantity we are interested in. Here are three justifications, although I think the third is overstated.
Information content is logarithmic
For a probability$p_1$, the log probability$\log(p_1)$ is the Shannon information of$p_1$. Shannon chose the logarithm so that if two separate events with probabilities$p_1$ and$p_2$ coincide, the information of the join event with probability$p_3 = p_1 p_2$ could be the addition of the information of the separate events,$\operatorname{information}(p_3) = \operatorname{information}(p_1) + \operatorname{information}(p_2)$. The logarithm is the only operation that satisfies this (up to a multiplicative constant).
Expected log-likelihood arises from weighted probabilities
In some cases, you have probabilities which are themselves weighted by a different distribution, and you wish to maximize likelihood. This is the case in the expectation-maximization algorithm, which you can also find at the basis of training variational encoders. At some point, you have obtained and fixed a probability for each of a set of samples$ P(x_1) = p_1, P(x_2) = p_2, ..., P(x_N) = p_N$, and you want to tweak the parameters that determine$ P $ to maximize likelihood, but do so in a weighted fashion: you will fix$p_1$ and use this to weight your new prediction$f_1$, doing the same for$(p_2, f_2)$ and so on. Now, instead of trying to maximize$\prod_{i=1}^{N}p_i$, we wish to maximize the weighted generalization, which in our case is$\prod_{i=1}^{N} f_i^{p_i}$. This is the correct target to maximize given that the probabilities are themselves weighted. What is interesting, is that taking the log yields:$$\log\left( \prod_{i=1}^{N} f_i^{p_i} \right) = \sum_{i=1}^{N}p_i \log(f_i) = \mathbb{E}_{f_i \sim P}[ \log(f_i)]$$Which I think is reason to be interested in the log-likelihood.
Instead of fixed probabilities, you might be doing this weighted expectation in a Monte Carlo fashion just by drawing samples according to the data distribution. From this perspective, the log of the likelihood is not some numerical trick, but truly a quantity whose average value you wish to maximize in order to maximize the weighted likelihood.
Mini-batch gradient descent (A poor reason?)
Say you have$ N $ samples, and give a model batches of size, say$ B =4 $. Based on whatever loss you are using you find yourself wanting to maximize the likelihood of your$ N $ samples by tweaking the parameters of the model. You want to update the model after each batch, but to do that, you cannot determine the relative contribution a batch$(p_1, p_2, p_3, p_4) $ makes to the full likelihood$\prod_{i=1}^{N}p_i $ when you only know$p_1, p_2, p_3, p_4$, but you can know the contributions if instead you were trying to maximize$ \log(\prod_{i=1}^{N}p_i)$ which is equal to$\sum_{i=1}^{N}\log(p_i)$.
The reason I think this argument is overstated is that the derivative, used for the loss gradient, separates nicely without any need for a log.
$$\frac{d}{d\theta} p_{\theta}(x_1) p_{\theta}(x_2) ... p_{\theta}(x_N)$$after applying the product rule becomes:$$Z \sum_{i=1}^{N} \frac{1}{p_{\theta}(x_i)} \left(\frac{d}{d\theta} p_{\theta}(x_i) \right)$$Where$Z = p_1p_2\ldots p_N$. This doesn't need any log operation in order for it to separate into a sum which can be calculated per-batch.
- $\begingroup$Related:stats.stackexchange.com/questions/87182/…$\endgroup$2025-07-01 15:08:39 +00:00CommentedJul 1 at 15:08
Explore related questions
See similar questions with these tags.




