Recent Advances in Non-Convex Optimization for Deep Learning

8 minute read

Published:

This post contains a summary of recent advances in non-convex optimization in deep learning, discussing about the optimality of local minima for several models, the issue of saddle points and modifications to stochastic gradient descent which are robust to saddle points.

It is well known that the objective function of neural networks is highly non-convex (it is individually convex with respect to the weights of each layer, but it is not jointly convex). Thus using gradient based methods or second order methods, we aren’t guaranteed to converge to the global minima. So how good is the critical/stationary point (i.e. a point where the derivatives of the loss function with respect to the parameters/weights are zero) that we converge to? Is it a “good enough” local minima or is it even a local mimina or just a saddle point? These questions have been the subject of active research recently and one can find several papers in NIPS, ICML, ICLR etc. addressing some of these questions.

Firstly, let me discuss briefly about the notion of “good enough” local minima. There are numerous papers which prove the global optimality of local minima for specific architectures such as [1], [2], [3], [4], [5], [6], [7], [8], [9] and [10] (the list goes on!). The implication of these papers is that for several deep learning architectures, if we are sure that we have converged to a local minima (and not stuck at a saddle point!), we know that we have reached the global optima. In other words, these papers show that there are no “spurious” local minima, i.e. local minima which aren’t globally optimal for several architectures. Interestingly, however, there was a paper published in ICML 2018 ([11]) which shows that spurious local minima do exist for a very simple two layer ReLU network! Also I would be remiss in not mentioning [17] which is a very comprehensive paper on the existence of spurious local minima, analyzing this issue in much more depth rather than pointing out only pathological examples. These papers tell us that we shouldn’t be misled into generalizing the global optimality of local minima for all archictectures.

Now comes the issue of saddle points. A saddle point is a critical point where the Hessian is neither positive definite nor negative definite, i.e. some eigenvalues of the Hessian are positive while some are negative. For a local minima, all the eigenvalues of the Hessian are strictly positive (Hessian is positive definite). In [12], it is mentioned that saddle points largely outnumber local minima in high dimensional problems (such as in deep learning), which should make life difficult for us. The authors in [12] claim that gradient based methods are repelled away from saddle points (yay?) but flat loss surfaces (where the negative eigenvalues are very small) make it difficult for gradient based methods to escape from saddle points. On the other hand, in [1], it is shown that (for binary classification networks) there is a critical loss value below which almost all critical points are local minima. The authors in [13] do an analytical estimation of the index (fraction of negative eigenvalues of the Hessian) at a critical point for a single hidden layer ReLU network with the squared loss function, as a function of the loss value at the critical point. In their analysis too, they get a critical loss value below which the index is 0, i.e. the critical point is a local minima. In the paper that I submitted recently, I have done the same analysis for the regularized cross entropy loss function instead and I too get a critical loss value below which the index is 0. This is great news, since if we have converged to a critical point with a low loss value (of course how low is not known), we can be reasonably sure that it is a local minima, which is also the global minima in certain cases!

Finally comes the question - how good are the current optimization algorithms? It has been shown in [14] that gradient descent takes an exponential amount of time to escape from saddle points. However, a noisy form of gradient descent can potentially escape saddle points. In [15], Perturbed Gradient Descent (which essentially perturbs the gradients with isotropic uniform noise) is proposed, which can converge to a local minima in poly-log time with respect to the dimension (i.e. number of parameters/weights) and can escape saddle points in logarithmic time with respect to the dimension and effectively linearithmic (i.e. linear times log) time with respect to the inverse of the minimum eigenvalue of the Hessian at that point. In [16], CNC-GD is proposed, which is independent of the dimension altogether! The authors show that under the assumption of stochastic gradients exhibiting significant components along the eigenvector corresponding to the minimum eigenvalue of the Hessian (termed as the Correlated Negative Curvature assumption) the need to perturb the stochastic gradients with isotropic noise is obviated. This removes the dependency on the dimension. The complexity with respect to the minimum eigenvalue is effectively the same as that of PGD. In my paper (focuses only on escaping saddle points!), the proposed algorithm improves the complexity with respect to the minimum eigenvalue compared to both PGD and CNC-GD, whereas the complexity with respect to the dimension (logarithmic in number of positive eigenvalues < dimension) is better than PGD.

In conclusion, current optimization algorithms are fairly robust to saddle points and are able to converge to local minima, as more and more improvements continue to come up. For many models, the local minima are also globally optimal (or at least nearly so), but there might be several unknown cases where this does not hold.

Hope this gives you a brief insight into non-convex optimization in deep learning!

REFERENCES -

[1] Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G. B.; and LeCun, Y. 2015. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, 192–204.

[2] Kawaguchi, K. 2016. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 586–594.

[3] Nguyen, Q., and Hein, M. 2017. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045.

[4] Nguyen, Q., and Hein, M. 2018. Optimization landscape and expressivity of deep cnns. In International Conference on Machine Learning, 3727–3736.

[5] Freeman, C. D., and Bruna, J. 2016. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540.

[6] Hardt, M., and Ma, T. 2016. Identity matters in deep learning. arXiv preprint arXiv:1611.04231.

[7] Yun, C.; Sra, S.; and Jadbabaie, A. 2017. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

[8] Du, S. S., and Lee, J. D. 2018. On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206.

[9] Du, S. S.; Lee, J. D.; Tian, Y.; Poczos, B.; and Singh, A. 2017b. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779.

[10] Laurent, T., and Brecht, J. 2018. Deep linear networks with arbitrary loss: All local minima are global. In International Conference on Machine Learning, 2908–2913.

[11] Safran, I., and Shamir, O. 2017. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968.

[12] Dauphin, Y. N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli, S.; and Bengio, Y. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, 2933–2941.

[13] Pennington, J., and Bahri, Y. 2017. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, 2798–2806.

[14] Du, S. S.; Jin, C.; Lee, J. D.; Jordan, M. I.; Singh, A.; and Poczos, B. 2017a. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, 1067–1077.

[15] Jin, C.; Ge, R.; Netrapalli, P.; Kakade, S. M.; and Jordan, M. I. 2017. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887.

[16] Daneshmand, H.; Kohler, J.; Lucchi, A.; and Hofmann, T.

  1. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999.

[17]Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. “A Critical View of Global Optimality in Deep Learning.” arXiv preprint arXiv:1802.03487 (2018).