The limitations of gradient descent as a principle of brain function

LolWolf · on June 11, 2018

The "money" quote is right here,

> Steepest descent or gradient descent depends on a choice of ruler (or Riemannian metric) for the parameter space of interest. The Euclidean metric is rarely a natural choice, especially for spaces with dimensions that carry different physical units, such as the example shown here.

But I think this, to many ML researchers or biophysicists, is true by what I lovingly call the NSH[0]. The argument posed (in a slightly anthropomorphized way) goes as follows:

-

We have some function, f, which we want to minimize/maximize: usually f will be the negative log of some probability distribution, or whatever. Now, hey, look, this function could be anything that's differentiable, all we know is that the minimum of this function is really useful.

But, what happens if we replace this function by another function, g, whose minimum at x is the same, but g(x + ∆) ≈ g(x) for ∆ being fairly large, then we run into a problem! Minimizing g (which is, admittedly, an unnatural representation of f) is bad because it's really hard for us to distinguish points around x from x itself (which is the point we want) using only the gradient information, since it's very close to zero. In other words, while the function g preserves many of the same properties as f (in fact, the only one we care about, the minimum), the trajectory of gradient descent on g is going to be mostly pretty dumb as compared to what it could be for f.

In the case that f is the negative log-likelihood of some empirical distribution, then the Fisher information turns out to be a natural choice for the curvature of the distribution (see Information Geometry, if you're interested in this idea) and running what is essentially a Newton method on f will yield similar trajectories to running it on g, which is what we want.

-

The paper makes no particularly strong argument that is of note, except that, yes, of course, "the brain" doesn't actually "run" gradient descent as stated on paper, which apparently is an assumption some neuroscience models make use of to make predictions. This is... not really a claim anyone is arguing for, as far as I know (the paper has citations [12,13], but taking a quick peek, only 12 mentions this idea directly, but even there invokes the "natural gradient" argument that the authors here propose, and only mentions, in passing, that this mechanism is "biologically plausible").

---

[0] More specifically, the No-Shit Hypothesis.

kurthr · on June 11, 2018

I liked this quote to give a feeling why the euclidian metric for the gradient might be wrong.

https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gr...

"As a concrete example, imagine two people standing on two different mountain tops. If one of the people is Superman (or anybody else who can fly) then the distance they would fly directly to the other person is the Euclidean distance (in R^3). If both people were normal and needed to walk on the surface of the Earth the Riemannian metric tensor tells us what this distance is from the Euclidean distance. Don’t take this illustration too seriously since technically all of this needs to take place in a differential patch (a small rectangle whose side lengths go to 0)."

_csoz · on June 10, 2018

Noted that the example they use from Rui Costa et al. 2017 (a great paper btw recommended reading) is not any kind of machine learning result, it is a fitting of data from previous experiments showing pre- and post-synaptic changes during short term LTP experiments. So i 'm not sure what the authors are rambling on about here.

api · on June 10, 2018

I really can't believe anyone argues that brains can just be gradient descent. It's like people have never heard of a fitness valley or a sparsely connected graph region.

mr_toad · on June 11, 2018

I hear the converse (neural networks can’t be like brains because they do use gradient descent) argued quite a lot.

api · on June 11, 2018

That is also weird. Brains may use gradient descent but that can't be the only way they learn. Gradient descent alone is almost completely incapable of escaping a local maximum.

dingo_bat · on June 11, 2018

Can't you run gradient descent on a higher level of abstraction and escape the local minima? People routinely get stuck thinking in a certain way, even when an observer looking at the problem from afar can tell that they are stuck.

api · on June 11, 2018

You can if you know what the higher level abstraction is. That kind of thing doesn't work when your learning algorithm doesn't have a programmer / mathematician to babysit it who knows more than it does.

... unless God is sitting there hacking our brains' code in real time to kick us out of local maxima. "Damn it... they're stuck on the mind/body dichotomy again... let me see what happens if I bolt a dimension reduction transform on there and then run another gradient descent on top. Shit now they're worshipping cheese... undo, undo, undo..."

joe_the_user · on June 10, 2018

I can't see a PDF here, I'm shown a 46 byte file to download.

jimfleming · on June 10, 2018

The previous version has a PDF[0] but you're right, the current version shows none.

[0] https://arxiv.org/pdf/1805.11851v1.pdf

alexanderchr · on June 10, 2018

The paper was withdrawn with v2

Comments: We were asked by an author of a criticized paper to withdraw the submisison in order to allow them to respond to our criticism in private

hshehehjdjdjd · on June 10, 2018

Is that commonly done? I would guess this means the withdrawn paper contains an egregious error. Does anyone know?

simonster · on June 10, 2018

tl;dr ordinary gradient descent is sensitive to the parametrization of the problem but natural gradient is not. This is an important fact (and one that is fairly well-known within the ML community), but it is not totally clear to me why it should be particularly relevant to neuroscience.

bjourne · on June 11, 2018

Isn't it only the trajectory that is sensitive to the parametrization? Afaik, the location of the minimas do not depend on it.

simonster · on June 12, 2018

Yep, the minima are in the same locations. However, if the problem has multiple minima, then the parametrization can affect which minimum gradient descent actually reaches.