4
$\begingroup$

Inthis answer to the questionIs an optimization algorithm equivalent to a neural network?, the author stated that, in theory, there is some recurrent neural network that implements a given optimization algorithm.

If so, then can we optimize the optimization algorithm?

nbro's user avatar
nbro
43.1k14 gold badges121 silver badges222 bronze badges
askedJul 23, 2019 at 22:02
Dimer's user avatar
$\endgroup$
1
  • $\begingroup$Here is a related question.$\endgroup$CommentedDec 12, 2021 at 9:11

3 Answers3

5
$\begingroup$

First, you need to consider what are the "parameters" of this "optimization algorithm" that you want to "optimize". Let's take the most simple case, a SGD without momentum. The update rule for this optimizer is:

$$w_{t+1} \leftarrow w_{t} - a \cdot \nabla_{w_{t}} J(w_t) = w_{t} - a \cdot g_t$$

where$w_t$ are the weights at iteration$t$,$J$ is the cost function,$g_t = \nabla_{w_{t}} J(w_t)$ are the gradients of the cost function w.r.t$w_t$ and$a$ is the learning rate.

An optimization algorithm accepts as its input the weights and their gradients and returns the update. So we could write the above equation as:

$$w_{t+1} \leftarrow w_{t} - SGD(w_t, g_t)$$

The same is true for all optimization algorithms (e.g. Adam, RMSprop, etc.). Now our initial question was what are theparameters of the optimizer, which we want to optimize. In the simple case of the SGD, the sole parameter of the optimizer is thelearning rate.

The question that arises at this point iscan we optimize the learning rate of the optimizer during training? Or more practically, can we compute this derivative?

$$\frac{\partial J(w_t)}{\partial a}$$

This idea was explored inthis paper, where they coin this technique "hypergradient descent". I suggest you take a look.

answeredJul 23, 2019 at 22:28
Djib2011's user avatar
$\endgroup$
2
  • $\begingroup$In particular, one would expect an 'optimised algorithm' not to need any external parameters to be chosen.$\endgroup$CommentedJul 28, 2019 at 12:13
  • $\begingroup$It would be nice if you at least briefly described the idea of computing the derivative with respect to the learning rate proposed in that paper. I suppose that the loss function is also a function of the learning rate, and that's how they do it, otherwise, right now, I'm not seeing how they could do that.$\endgroup$CommentedJan 25, 2021 at 22:56
2
$\begingroup$

We usually optimizewith respect to something. For example, you can train a neural network to locate cats in an image. This operation of locating cats in an image can be thought of as a function: given an image, a neural network can be trained to return the position of the cat in the image. In this sense, we can optimize a neural network with respect to this task.

However, if a neural network represents an optimization algorithm, then, if you change it a little bit, then it will no more be the same optimization algorithm: it might be another optimization algorithm or some other different algorithm.

For example, most optimizations algorithms that are used to train neural networks (like Adam) are a variation of gradient descent (GD). If you think that Adam performs better than GD, then you could say that Adam is an optimization of GD. So, Adam performs better than GD with respect to something. Possibly, GD also performs better than Adam with respect to something else. Of course, this is a little bit of a stretch.

answeredJul 23, 2019 at 22:29
nbro's user avatar
$\endgroup$
0
0
$\begingroup$

That does not seem very useful to apply local minima search (as SGD) to another local minima search. Existing successful solutions combine global minima search techniques with local minima search.

For example, it's beneficial to combine simulated annealing with SGD to optimize it's learning rate and/or Nesterov momentum. In this case, you don't even need to spawn a population of SGD optimizers. But, you can also try population-based algorithms like evolution programming.

The idea to optimize optimizers is very curious, but it's rather useful to try it on global optimization algorithms.

answeredMay 13, 2020 at 19:48
Bogdan Ruzhitskiy's user avatar
$\endgroup$
1
  • $\begingroup$Could you please provide a link to a research work/paper where this has been done: "it's beneficial to combine simulated annealing with SGD"?$\endgroup$CommentedJan 25, 2021 at 22:54

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.