Can we optimize an optimization algorithm?

Question 1

Inthis answer to the questionIs an optimization algorithm equivalent to a neural network?, the author stated that, in theory, there is some recurrent neural network that implements a given optimization algorithm.

If so, then can we optimize the optimization algorithm?

Question 2

Here is a related question.

Question 3

First, you need to consider what are the "parameters" of this "optimization algorithm" that you want to "optimize". Let's take the most simple case, a SGD without momentum. The update rule for this optimizer is:

$$w_{t+1} \leftarrow w_{t} - a \cdot \nabla_{w_{t}} J(w_t) = w_{t} - a \cdot g_t$$

where$w_t$ are the weights at iteration$t$,$J$ is the cost function,$g_t = \nabla_{w_{t}} J(w_t)$ are the gradients of the cost function w.r.t$w_t$ and$a$ is the learning rate.

An optimization algorithm accepts as its input the weights and their gradients and returns the update. So we could write the above equation as:

$$w_{t+1} \leftarrow w_{t} - SGD(w_t, g_t)$$

The same is true for all optimization algorithms (e.g. Adam, RMSprop, etc.). Now our initial question was what are theparameters of the optimizer, which we want to optimize. In the simple case of the SGD, the sole parameter of the optimizer is thelearning rate.

The question that arises at this point iscan we optimize the learning rate of the optimizer during training? Or more practically, can we compute this derivative?

$$\frac{\partial J(w_t)}{\partial a}$$

This idea was explored inthis paper, where they coin this technique "hypergradient descent". I suggest you take a look.

Question 4

In particular, one would expect an 'optimised algorithm' not to need any external parameters to be chosen.

Question 5

It would be nice if you at least briefly described the idea of computing the derivative with respect to the learning rate proposed in that paper. I suppose that the loss function is also a function of the learning rate, and that's how they do it, otherwise, right now, I'm not seeing how they could do that.

Question 6

We usually optimizewith respect to something. For example, you can train a neural network to locate cats in an image. This operation of locating cats in an image can be thought of as a function: given an image, a neural network can be trained to return the position of the cat in the image. In this sense, we can optimize a neural network with respect to this task.

However, if a neural network represents an optimization algorithm, then, if you change it a little bit, then it will no more be the same optimization algorithm: it might be another optimization algorithm or some other different algorithm.

For example, most optimizations algorithms that are used to train neural networks (like Adam) are a variation of gradient descent (GD). If you think that Adam performs better than GD, then you could say that Adam is an optimization of GD. So, Adam performs better than GD with respect to something. Possibly, GD also performs better than Adam with respect to something else. Of course, this is a little bit of a stretch.

Question 7

That does not seem very useful to apply local minima search (as SGD) to another local minima search. Existing successful solutions combine global minima search techniques with local minima search.

For example, it's beneficial to combine simulated annealing with SGD to optimize it's learning rate and/or Nesterov momentum. In this case, you don't even need to spawn a population of SGD optimizers. But, you can also try population-based algorithms like evolution programming.

The idea to optimize optimizers is very curious, but it's rather useful to try it on global optimization algorithms.

Question 8

Could you please provide a link to a research work/paper where this has been done: "it's beneficial to combine simulated annealing with SGD"?

Djib2011 3,2333 gold badges19 silver badges22 bronze badges · Accepted Answer · 2019-07-23 22:28:27Z

First, you need to consider what are the "parameters" of this "optimization algorithm" that you want to "optimize". Let's take the most simple case, a SGD without momentum. The update rule for this optimizer is:

$$w_{t+1} \leftarrow w_{t} - a \cdot \nabla_{w_{t}} J(w_t) = w_{t} - a \cdot g_t$$

where$w_t$ are the weights at iteration$t$,$J$ is the cost function,$g_t = \nabla_{w_{t}} J(w_t)$ are the gradients of the cost function w.r.t$w_t$ and$a$ is the learning rate.

An optimization algorithm accepts as its input the weights and their gradients and returns the update. So we could write the above equation as:

$$w_{t+1} \leftarrow w_{t} - SGD(w_t, g_t)$$

The same is true for all optimization algorithms (e.g. Adam, RMSprop, etc.). Now our initial question was what are theparameters of the optimizer, which we want to optimize. In the simple case of the SGD, the sole parameter of the optimizer is thelearning rate.

The question that arises at this point iscan we optimize the learning rate of the optimizer during training? Or more practically, can we compute this derivative?

$$\frac{\partial J(w_t)}{\partial a}$$

This idea was explored inthis paper, where they coin this technique "hypergradient descent". I suggest you take a look.

In particular, one would expect an 'optimised algorithm' not to need any external parameters to be chosen.
It would be nice if you at least briefly described the idea of computing the derivative with respect to the learning rate proposed in that paper. I suppose that the loss function is also a function of the learning rate, and that's how they do it, otherwise, right now, I'm not seeing how they could do that.

Movatterモバイル変換

Stack Exchange Network

Can we optimize an optimization algorithm?

3 Answers3

You mustlog in to answer this question.

Linked

Related

Hot Network Questions

Subscribe to RSS