Linear regression: Loss

  • Loss is a numerical value indicating the difference between a model's predictions and the actual values.

  • The goal of model training is to minimize loss, bringing it as close to zero as possible.

  • Two common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.

  • Choosing between MAE and MSE depends on the dataset and how you want the model to handle outliers, with MSE penalizing them more heavily.

Loss is a numerical metric that describeshow wrong a model'spredictionsare. Loss measures the distance between the model's predictions and the actuallabels. The goal of training a model is to minimize the loss, reducing it to itslowest possible value.

In the following image, you can visualize loss as arrows drawn from the datapoints to the model. The arrows show how far the model's predictions are fromthe actual values.

Figure 8. Loss lines connect the data points to themodel.

Figure 8. Loss is measured from the actual value to the predicted value.

Distance of loss

In statistics and machine learning, loss measures the difference between thepredicted and actual values. Loss focuses on thedistance between the values,not the direction. For example, if a model predicts 2, but the actual value is5, we don't care that the loss is negative (2 – 5= –3). Instead, we care thatthedistance between the values is 3. Thus, all methods for calculating lossremove the sign.

The two most common methods to remove the sign are the following:

  • Take the absolute value of the difference between the actual value and theprediction.
  • Square the difference between the actual value and the prediction.

Types of loss

In linear regression, there are five main types of loss, which are outlined inthe following table.

Loss typeDefinitionEquation
L1 loss The sum of the absolute values of the difference between the predicted values and the actual values.$ ∑ | actual\ value - predicted\ value | $
Mean absolute error (MAE) The average of L1 losses across a set ofN examples.$ \frac{1}{N} ∑ | actual\ value - predicted\ value | $
L2 loss The sum of the squared difference between the predicted values and the actual values. $ ∑(actual\ value - predicted\ value)^2 $
Mean squared error (MSE) The average of L2 losses across a set ofN examples.$ \frac{1}{N} ∑ (actual\ value - predicted\ value)^2 $
Root mean squared error (RMSE) The square root of the mean squared error (MSE).$ \sqrt{\frac{1}{N} ∑ (actual\ value - predicted\ value)^2} $

The functional difference between L1 loss and L2 loss(or between MAE/RMSE and MSE) is squaring. When the difference between theprediction and label is large, squaring makes the loss even larger. When thedifference is small (less than 1), squaring makes the loss even smaller.

Loss metrics like MAE and RMSE may be preferable to L2 loss or MSE insome use cases because they tend to be more human-interpretable, as they measureerror using the same scale as the model's predicted value.

Note: MAE and RMSE can differ quite widely. MAE represents the averageprediction error, whereas RMSE represents the "spread" of the errors, and ismore skewed by larger errors.

When processing multiple examples at once, we recommend averaging the lossesacross all the examples, whether using MAE, MSE, or RMSE.

Calculating loss example

In the previous section, we created the followingmodel to predict fuel efficiency based oncar heaviness:

  • Model: $ y' = 34 + (-4.6)(x_1) $
    • Weight: $ –4.6 $
    • Bias: $ 34 $

If the model predicts that a 2,370-pound car gets 23.1 miles per gallon, but itactually gets 24 miles per gallon, we would calculate the L2 lossas follows:

Note: The formula uses 2.37 because the graphs are scaled to 1000s of pounds.
ValueEquationResult
Prediction

$\small{bias + (weight * feature\ value)}$

$\small{34 + (-4.6*2.37)}$

$\small{23.1}$
Actual value$ \small{ label } $$ \small{ 24 } $
L2 loss

$ \small{ (actual\ value - predicted\ value)^2 } $

$\small{ (24 - 23.1)^2 }$

$\small{0.81}$

In this example, the L2 loss for that single data point is 0.81.

Choosing a loss

Deciding whether to use MAE or MSE can depend on the dataset and the way youwant to handle certain predictions. Most feature values in a dataset typicallyfall within a distinct range. For example, cars are normally between 2000 and5000 pounds and get between 8 to 50 miles per gallon. An 8,000-pound car,or a car that gets 100 miles per gallon, is outside the typical range and wouldbe considered anoutlier.

An outlier can also refer to how far off a model's predictions are from the realvalues. For instance, 3,000 pounds is within the typical car-weight range, and40 miles per gallon is within the typical fuel-efficiency range. However, a3,000-pound car that gets 40 miles per gallon would be an outlier in terms ofthe model's prediction because the model would predict that a 3,000-pound carwould get around 20 miles per gallon.

When choosing the best loss function, consider how you want the model to treatoutliers. For instance, MSE moves the model more toward the outliers, while MAEdoesn't. L2 loss incurs a much higher penalty for an outlier thanL1 loss. For example, the following images show a model trainedusing MAE and a model trained using MSE. The red line represents a fullytrained model that will be used to make predictions. The outliers are closer tothe model trained with MSE than to the model trained with MAE.

Figure 9. The model is tilted more toward the outliers.

Figure 9. MSE loss moves the model closer to the outliers.

Figure 10. The model is tilted further away from the outliers.

Figure 10. MAE loss keeps the model farther from the outliers.

Note the relationship between the model and the data:

  • MSE. The model is closer to the outliers but further away from most ofthe other data points.

  • MAE. The model is further away from the outliers but closer to most ofthe other data points.

Click the icon for more guidelines on choosing a loss metric

Check Your Understanding

Consider the following two plots of a linear model fit to a dataset:

A plot of 10 points.      A line runs through 6 of the points. 2 points are 1 unit      above the line; 2 other points are 1 unit below the line.A plot of 10 points. A line runs      through 8 of the points. 1 point is 2 units      above the line; 1 other point is 2 units below the line.
Which of the two linear models shown in the preceding plots has thehigher Mean Squared Error (MSE) when evaluated on the plotted data points?
The model on the left.
The six examples on the line incur a total loss of 0. The four examples not on the line are not very far off the line, so even squaring their offset still yields a low value: $MSE = \frac{0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 0^2} {10} = 0.4$
The model on the right.
The eight examples on the line incur a total loss of 0. However, although only two points lay off the line, both of those points aretwice as far off the line as the outlier points in the left figure. Squared loss amplifies those differences, so an offset of two incurs a loss four times as great as an offset of one: $MSE = \frac{0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2} {10} = 0.8$
Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-05 UTC.