Linear regression: Hyperparameters

Page Summary

Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.
The learning rate determines the step size during model training, impacting the speed and stability of convergence.
Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.
Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.
Choosing appropriate hyperparameters is crucial for optimizing model training and achieving desired results.

Hyperparameters are variablesthat control different aspects of training. Three common hyperparameters are:

In contrast,parameters are thevariables, like the weights and bias, that are part of the model itself. Inother words, hyperparameters are values that you control; parameters are valuesthat the model calculates during training.

Learning rate

Learning rate is afloating point number you set that influences how quickly themodel converges. If the learning rate is too low, the model can take a long timeto converge. However, if the learning rate is too high, the model neverconverges, but instead bounces around the weights and bias that minimize theloss. The goal is to pick a learning rate that's not too high nor too low sothat the model converges quickly.

The learning rate determines the magnitude of the changes to make to the weightsand bias during each step of the gradient descent process. The model multipliesthe gradient by the learning rate to determine the model's parameters (weightand bias values) for the next iteration. In the third step ofgradientdescent, the "small amount" to move in the directionof negative slope refers to the learning rate.

The difference between the old model parameters and the new model parameters isproportional to the slope of the loss function. For example, if the slope islarge, the model takes a large step. If small, it takes a small step. Forexample, if the gradient's magnitude is 2.5 and the learning rate is 0.01, thenthe model will change the parameter by 0.025.

The ideal learning rate helps the model to converge within a reasonable numberof iterations. In Figure 20, the loss curve shows the model significantlyimproving during the first 20 iterations before beginning to converge:

Figure 20. Loss curve that shows a steep slope before flattening out.

Figure 20. Loss graph showing a model trained with a learning rate thatconverges quickly.

In contrast, a learning rate that's too small can take too many iterations toconverge. In Figure 21, the loss curve shows the model making only minorimprovements after each iteration:

Figure 21. Loss curve that shows an almost 45-degree slope.

Figure 21. Loss graph showing a model trained with a small learning rate.

A learning rate that's too large never converges because each iteration eithercauses the loss to bounce around or continually increase. In Figure 22, the losscurve shows the model decreasing and then increasing loss after each iteration,and in Figure 23 the loss increases at later iterations:

Figure 22. Loss curve that shows jagged up-and-down line.

Figure 22. Loss graph showing a model trained with a learning rate that'stoo big, where the loss curve fluctuates wildly, going up and down as theiterations increase.

Figure 23. Loss curve that shows the loss increasing at later iterations

Figure 23. Loss graph showing a model trained with a learning rate that'stoo big, where the loss curve drastically increases in later iterations.

Exercise: Check your understanding

What is the ideal learning rate?

The ideal learning rate is problem-dependent.

Each model and dataset will have its own ideal learning rate.

0.01

1.0

Batch size

Batch size is a hyperparameter thatrefers to the number ofexamplesthe model processes before updating its weightsand bias. You might think that the model should calculate the loss foreveryexample in the dataset before updating the weights and bias. However, when adataset contains hundreds of thousands or even millions of examples, using thefull batch isn't practical.

Two common techniques to get the right gradient onaverage without needing tolook at every example in the dataset before updating the weights and bias arestochastic gradient descent andmini-batch stochastic gradientdescent:

Stochastic gradient descent (SGD): Stochastic gradient descent uses onlya single example (a batch size of one) per iteration. Given enoughiterations, SGD works but is very noisy. "Noise" refers to variations duringtraining that cause the loss to increase rather than decrease during aniteration. The term "stochastic" indicates that the one example comprisingeach batch is chosen at random.
Notice in the following image how loss slightly fluctuates as the modelupdates its weights and bias using SGD, which can lead to noise in the lossgraph:
Figure 24. Model trained with stochastic gradient descent (SGD) showingnoise in the loss curve.
Note that using stochastic gradient descent can produce noise throughout theentire loss curve, not just near convergence.
Mini-batch stochastic gradient descent (mini-batch SGD): Mini-batchstochastic gradient descent is a compromise between full-batch and SGD. For$ N $ number of data points, the batch size can be any number greater than 1and less than $ N $. The model chooses the examples included in each batchat random, averages their gradients, and then updates the weights and biasonce per iteration.
Determining the number of examples for each batch depends on the dataset andthe available compute resources. In general, small batch sizes behaves likeSGD, and larger batch sizes behaves like full-batch gradient descent.
Figure 25. Model trained with mini-batch SGD.

When training a model, you might think that noise is an undesirablecharacteristic that should be eliminated. However, a certain amount of noise canbe a good thing. In later modules, you'll learn how noise can help a modelgeneralize better and find theoptimal weights and bias in aneuralnetwork.

Epochs

During training, anepoch means that themodel has processed every example in the training setonce. For example, givena training set with 1,000 examples and a mini-batch size of 100 examples, itwill take the model 10iterations tocomplete one epoch.

Training typically requires many epochs. That is, the system needs to processevery example in the training set multiple times.

The number of epochs is a hyperparameter you set before the model beginstraining. In many cases, you'll need to experiment with how many epochs it takesfor the model to converge. In general, more epochs produces a better model, butalso takes more time to train.

Figure 26. A full batch is the entire dataset, a mini batch is a subset of the dataset, and an epoch is a full pass through ten mini batches.

Figure 26. Full batch versus mini batch.

The following table describes how batch size and epochs relate to the number oftimes a model updates its parameters.

Batch type	When weights and bias updates occur
Full batch	After the model looks at all the examples in the dataset. For instance, if a dataset contains 1,000 examples and the model trains for 20 epochs, the model updates the weights and bias 20 times, once per epoch.
Stochastic gradient descent	After the model looks at a single example from the dataset. For instance, if a dataset contains 1,000 examples and trains for 20 epochs, the model updates the weights and bias 20,000 times.
Mini-batch stochastic gradient descent	After the model looks at the examples in each batch. For instance, if a dataset contains 1,000 examples, and the batch size is 100, and the model trains for 20 epochs, the model updates the weights and bias 200 times.

Exercise: Check your understanding

1. What's the best batch size when using mini-batch SGD?

It depends

The ideal batch size depends on the dataset and the available compute resources

10 examples per batch

100 examples per batch

2. Which of the following statements is true?

Larger batches are unsuitable for data with many outliers.

This statement is false. By averaging more gradients together, larger batch sizes can help reduce the negative effects of having outliers in the data.

Doubling the learning rate can slow down training.

This statement is true. Doubling the learning rate can result in a learning rate that is too large, and therefore cause the weights to "bounce around," increasing the amount of time needed to converge. As always, the best hyperparameters depend on your dataset and available compute resources.

Key terms:

Help Center

Gradient descent (10 min)

Interactive exercise: Gradient descent (5 min)

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-03 UTC.

Movatterモバイル変換

Linear regression: Hyperparameters Stay organized with collections Save and categorize content based on your preferences.

Page Summary

Learning rate

Exercise: Check your understanding

Batch size

Epochs

Exercise: Check your understanding

Linear regression: Hyperparameters