Overfitting: L2 regularization

  • L2 regularization is a technique used to reduce model complexity and prevent overfitting by penalizing large weights.

  • A regularization rate (lambda) controls the strength of regularization, with higher values leading to simpler models and lower values increasing the risk of overfitting.

  • Early stopping is an alternative regularization method that involves ending training before the model fully converges to prevent overfitting.

  • Finding the right balance between learning rate and regularization rate is crucial for optimal model performance, as they influence weights in opposite directions.

L2 regularizationis a popular regularization metric, which uses the following formula:

$$L_2\text{ regularization } = {w_1^2 + w_2^2 + ... + w_n^2}$$

For example, the following table shows the calculation of L2regularization for a model with six weights:

ValueSquared value
w10.20.04
w2-0.50.25
w35.025.0
w4-1.21.44
w50.30.09
w6-0.10.01
  26.83 = total

Notice that weights close to zero don't affect L2 regularizationmuch, but large weights can have a huge impact. For example, in thepreceding calculation:

  • A single weight (w3) contributes about 93% of thetotal complexity.
  • The other five weights collectively contribute only about 7% of thetotal complexity.

L2 regularization encourages weightstoward 0, but never pushesweights all the way to zero.

Exercises: Check your understanding

If you use L2 regularization while training a model, what will typically happen to the overall complexity of the model?
The overall complexity of the system will probably drop.
Since L2 regularization encourages weights towards 0, the overall complexity will probably drop.
The overall complexity of the model will probably stay constant.
This is very unlikely.
The overall complexity of the model will probably increase.
This is unlikely. Remember that L2 regularization encourages weights towards 0.
If you use L2 regularization while training a model, some features will be removed from the model.
True
Although L2 regularization may make some weights very small, it will never push any weights all the way to zero. Consequently, all features will still contribute something to the model.
False
L2 regularization never pushes weights all the way to zero.

Regularization rate (lambda)

As noted, training attempts to minimize some combination of loss and complexity:

$$\text{minimize(loss} + \text{ complexity)}$$

Model developers tune the overall impact of complexity on model trainingby multiplying its value by a scalar called theregularization rate.The Greek character lambda typically symbolizes the regularization rate.

That is, model developers aim to do the following:

$$\text{minimize(loss} + \lambda \text{ complexity)}$$

A high regularization rate:

  • Strengthens the influence of regularization, thereby reducing the chances ofoverfitting.
  • Tends to produce a histogram of model weights having the followingcharacteristics:
    • a normal distribution
    • a mean weight of 0.

A low regularization rate:

  • Lowers the influence of regularization, thereby increasing the chances ofoverfitting.
  • Tends to produce a histogram of model weights with a flat distribution.

For example, the histogram of model weights for a high regularization ratemight look as shown in Figure 18.

Figure 18. Histogram of a model's weights with a mean of zero and            a normal distribution.
Figure 18. Weight histogram for a high regularization rate. Mean is zero. Normal distribution.

 

In contrast, a low regularization rate tends to yield a flatter histogram, asshown in Figure 19.

Figure 19. Histogram of a model's weights with a mean of zero that            is somewhere between a flat distribution and a normal            distribution.
Figure 19. Weight histogram for a low regularization rate. Mean may or may not be zero.

 

Note: Setting the regularization rate to zero removes regularization completely.In this case, training focuses exclusively on minimizing loss, whichposes the highest possible overfitting risk.

Picking the regularization rate

The ideal regularization rate produces a model that generalizes well tonew, previously unseen data.Unfortunately, that ideal value is data-dependent,so you must do sometuning.

Early stopping: an alternative to complexity-based regularization

Early stopping is aregularization method that doesn't involve a calculation of complexity.Instead, early stopping simply means ending training before the modelfully converges. For example, you end training when the loss curvefor the validation set starts to increase (slope becomes positive).

Although early stopping usually increases training loss, it can decreasetest loss.

Early stopping is a quick, but rarely optimal, form of regularization.The resulting model is very unlikely to be as good as a model trainedthoroughly on the ideal regularization rate.

Finding equilibrium between learning rate and regularization rate

Learning rate andregularization rate tend to pull weights in oppositedirections. A high learning rate often pulls weightsaway from zero;a high regularization rate pulls weightstowards zero.

If the regularization rate is high with respect to the learning rate,the weak weights tend to produce a model that makes poor predictions.Conversely, if the learning rate is high with respect to the regularizationrate, the strong weights tend to produce an overfit model.

Your goal is to find the equilibrium between learning rate andregularization rate. This can be challenging. Worst of all, once you findthat elusive balance, you may have to ultimately change the learning rate.And, when you change the learning rate, you'll again have to find the idealregularization rate.

Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-03 UTC.