Jekyll2019-12-25T10:20:24-08:00https://rudrajit15.github.io/feed.xmlRudrajit DasCS PhD student at UT AustinRudrajit Dasrudrajit1503@gmail.comGood Initialization for Alternating Minimization2018-10-26T00:00:00-07:002018-10-26T00:00:00-07:00https://rudrajit15.github.io/posts/2018/10/blog-post-4<p>This short post is on the importance of having proper initialization while using alternating minimization for two variables, such that the objective function under consideration is convex in each variable individually but not jointly convex with respect to both the variables.</p>
<p>Alternating minimization (AM) has been succesfully used for solving several non-convex problems such as matrix completion, dictionary learning, image deblurring, EM algorithm, matrix factorization, etc. However, there is relatively less work on providing conditions under which it can converge to the optimal solution or close to it. This work
<a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/02/altmin-altmin_pdf.pdf" style="color: #0000FF">here</a> is a summary of some very nice theoretical work on AM. The authors present their results on 3 important problems namely matrix completion, phase retrieval and dictionary learning for which AM is used. The thing that stood out for me, however, was that they provide a good initialization for each problem separately (for phase retrieval they even provide a better initialization).</p>
<p>Relating with my work, I have used AM for 2 optimization problems so far. One was in my paper “Sparse Kernel PCA for Outlier Detection” where it is used to obtain the approximate sparse eigenvectors. In this case, we were lucky, since using the actual eigenvectors as the initial solution worked pretty well (nothing genius about that!). It is in the second problem, where I am struggling to find a good initialization. The second problem is non-linear blind compressed sensing, which is very similar to the dictionary learning problem except that there is also a sensing matrix term (which is known by the way) that is different for every training example and most importantly a non-linear transformation of the data. Also, there are further positivity constraints on the dictionary as well as the sparse codes due to the domain of the non-linear function in consideration. The sensing matrix is also not the usual Gaussian or Bernoulli ($\pm 1$) matrix ):</p>
<p>So far, I just try with different random initializations and use the one which gives the best results. Obviously, this is extremely inefficient. Thus, it would be very nice to have a systematic way of choosing a good initialization even for non-linear problems wherein AM is used. This would be also very helpful in obtaining theoretical guarantees on its performance.</p>Rudrajit Dasrudrajit1503@gmail.comThis short post is on the importance of having proper initialization while using alternating minimization for two variables, such that the objective function under consideration is convex in each variable individually but not jointly convex with respect to both the variables.Extreme Value Theory (EVT) for Limiting Distributions of Extreme Events2018-10-04T00:00:00-07:002018-10-04T00:00:00-07:00https://rudrajit15.github.io/posts/2018/10/blog-post-3<p>This short post contains a few references for the relatively unknown (in my opinion) EVT among people who aren’t statisticians but use it
heavily (such as in machine learning).</p>
<p>The EVT basically states that the limiting distribution of the maximum or minimum of i.i.d random variables can be modelled by just 3 types
of distributions, which are the Gumbel, Frechet and Weibull types, as long as their individual distributions obey certain conditions (which
are generally obeyed for standard distributions that we usually deal with). It is in some ways the analogue of the central limit theorem
(which is regarding the distribution of the sums of i.i.d random variables).</p>
<p>I’m writing about EVT here because until about a month ago, I had never heard about it but while working on my previous paper, I had to do
some analysis on the maximum of a significantly large number of i.i.d Chi-squared random variables and the EVT result came in really handy
here. And this was after wasting a lot of time trying with the usual CDF of the max. of $n$ random variables, say each with CDF $F(.)$,
which would be simply $F(.)^{n}$. The power of the EVT really captivated me and I thought it’s worthy of being mentioned.</p>
<p>Coming to the references, the book “Modelling Extremal Events for Insurance and Finance” (yes finance!) by Paul Embrechts, Claudia
Klüppelberg, Thomas Mikosch contains several useful results related to EVT for several classes of distributions. Unfortunately, obtaining
a free and complete online version of this book might be difficult - it’s available as Google Books but with several pages missing!
There are also quite a few useful stack exchange discussions which one might find on the EVT such as this one
<a href="https://math.stackexchange.com/questions/450139/asymptotics-of-maxima-of-i-i-d-chi-square-random-variables" style="color: #0000FF">
here</a>. Also if one is interested in a more detailed discussion on EVT, there are some online pdfs available, one of which you can find
<a href="http://www.maths.manchester.ac.uk/~saralees/chap1.pdf" style="color: #0000FF">here</a>.</p>
<p>Please note that the EVT discussed here works only for i.i.d random variables, so if you are dealing with non i.i.d random variables
(as is often the case, for example with non-stationary stochastic processes), you have to suitably modify (which is non-trivial) the EVT.
The discussion of that is beyond the scope of this post.</p>
<p>Hope this post somewhat enlightens you about the existence of the EVT which can be very useful in certain situations.</p>Rudrajit Dasrudrajit1503@gmail.comThis short post contains a few references for the relatively unknown (in my opinion) EVT among people who aren’t statisticians but use it heavily (such as in machine learning).Recent Advances in Non-Convex Optimization for Deep Learning2018-09-15T00:00:00-07:002018-09-15T00:00:00-07:00https://rudrajit15.github.io/posts/2018/09/blog-post-2<p>This post contains a summary of recent advances in non-convex optimization in deep learning, discussing about the optimality of local minima for several models, the issue of saddle points and modifications to stochastic gradient descent which are robust to saddle points.</p>
<p>It is well known that the objective function of neural networks is highly non-convex (it is individually convex with respect to the weights of each layer, but it is not jointly convex). Thus using gradient based methods or second order methods, we aren’t guaranteed to converge to the global minima. So how good is the critical/stationary point (i.e. a point where the derivatives of the loss function with respect to the parameters/weights are zero) that we converge to? Is it a “good enough” local minima or is it even a local mimina or just a saddle point? These questions have been the subject of active research recently and one can find several papers in NIPS, ICML, ICLR etc. addressing some of these questions.</p>
<p>Firstly, let me discuss briefly about the notion of “good enough” local minima. There are numerous papers which prove the global optimality of local minima for specific architectures such as [1], [2], [3], [4], [5], [6], [7], [8], [9] and [10] (the list goes on!). The implication of these papers is that for several deep learning architectures, if we are sure that we have converged to a local minima (and not stuck at a saddle point!), we know that we have reached the global optima. In other words, these papers show that there are no “spurious” local minima, i.e. local minima which aren’t globally optimal for several architectures. Interestingly, however, there was a paper published in ICML 2018 ([11]) which shows that spurious local minima do exist for a very simple two layer ReLU network! Also I would be remiss in not mentioning [17] which is a very comprehensive paper on the existence of spurious local minima, analyzing this issue in much more depth rather than pointing out only pathological examples. These papers tell us that we shouldn’t be misled into generalizing the global optimality of local minima for all archictectures.</p>
<p>Now comes the issue of saddle points. A saddle point is a critical point where the Hessian is neither positive definite nor negative definite, i.e. some eigenvalues of the Hessian are positive while some are negative. For a local minima, all the eigenvalues of the Hessian are strictly positive (Hessian is positive definite). In [12], it is mentioned that saddle points largely outnumber local minima in high dimensional problems (such as in deep learning), which should make life difficult for us. The authors in [12] claim that gradient based methods are repelled away from saddle points (yay?) but flat loss surfaces (where the negative eigenvalues are very small) make it difficult for gradient based methods to escape from saddle points. On the other hand, in [1], it is shown that (for binary classification networks) there is a critical loss value below which almost all critical points are local minima. The authors in [13] do an analytical estimation of the index (fraction of negative eigenvalues of the Hessian) at a critical point for a single hidden layer ReLU network with the squared loss function, as a function of the loss value at the critical point. In their analysis too, they get a critical loss value below which the index is 0, i.e. the critical point is a local minima. In the paper that I submitted recently, I have done the same analysis for the regularized cross entropy loss function instead and I too get a critical loss value below which the index is 0. This is great news, since if we have converged to a critical point with a low loss value (of course how low is not known), we can be reasonably sure that it is a local minima, which is also the global minima in certain cases!</p>
<p>Finally comes the question - how good are the current optimization algorithms? It has been shown in [14] that gradient descent takes an exponential amount of time to escape from saddle points. However, a noisy form of gradient descent can potentially escape saddle points. In [15], Perturbed Gradient Descent (which essentially perturbs the gradients with isotropic uniform noise) is proposed, which can converge to a local minima in poly-log time with respect to the dimension (i.e. number of parameters/weights) and can escape saddle points in logarithmic time with respect to the dimension and effectively linearithmic (i.e. linear times log) time with respect to the inverse of the minimum eigenvalue of the Hessian at that point. In [16], CNC-GD is proposed, which is independent of the dimension altogether! The authors show that under the assumption of stochastic gradients exhibiting significant components along the eigenvector corresponding to the minimum eigenvalue of the Hessian (termed as the Correlated Negative Curvature assumption) the need to perturb the stochastic gradients with isotropic noise is obviated. This removes the dependency on the dimension. The complexity with respect to the minimum eigenvalue is effectively the same as that of PGD. In my paper (focuses only on escaping saddle points!), the proposed algorithm improves the complexity with respect to the minimum eigenvalue compared to both PGD and CNC-GD, whereas the complexity with respect to the dimension (logarithmic in number of positive eigenvalues < dimension) is better than PGD.</p>
<p>In conclusion, current optimization algorithms are fairly robust to saddle points and are able to converge to local minima, as more and more improvements continue to come up. For many models, the local minima are also globally optimal (or at least nearly so), but there might be several unknown cases where this does not hold.</p>
<p>Hope this gives you a brief insight into non-convex optimization in deep learning!</p>
<p><strong>REFERENCES -</strong></p>
<p>[1] Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G. B.;
and LeCun, Y. 2015. The loss surfaces of multilayer networks.
In Artificial Intelligence and Statistics, 192–204.</p>
<p>[2] Kawaguchi, K. 2016. Deep learning without poor local minima.
In Advances in Neural Information Processing Systems,
586–594.</p>
<p>[3] Nguyen, Q., and Hein, M. 2017. The loss surface
of deep and wide neural networks. arXiv preprint
arXiv:1704.08045.</p>
<p>[4] Nguyen, Q., and Hein, M. 2018. Optimization landscape
and expressivity of deep cnns. In International Conference
on Machine Learning, 3727–3736.</p>
<p>[5] Freeman, C. D., and Bruna, J. 2016. Topology and geometry
of half-rectified network optimization. arXiv preprint
arXiv:1611.01540.</p>
<p>[6] Hardt, M., and Ma, T. 2016. Identity matters in deep learning.
arXiv preprint arXiv:1611.04231.</p>
<p>[7] Yun, C.; Sra, S.; and Jadbabaie, A. 2017. Global optimality
conditions for deep neural networks. arXiv preprint
arXiv:1707.02444.</p>
<p>[8] Du, S. S., and Lee, J. D. 2018. On the power of overparametrization
in neural networks with quadratic activation.
arXiv preprint arXiv:1803.01206.</p>
<p>[9] Du, S. S.; Lee, J. D.; Tian, Y.; Poczos, B.; and Singh,
A. 2017b. Gradient descent learns one-hidden-layer cnn:
Don’t be afraid of spurious local minima. arXiv preprint
arXiv:1712.00779.</p>
<p>[10] Laurent, T., and Brecht, J. 2018. Deep linear networks with
arbitrary loss: All local minima are global. In International
Conference on Machine Learning, 2908–2913.</p>
<p>[11] Safran, I., and Shamir, O. 2017. Spurious local minima are
common in two-layer relu neural networks. arXiv preprint
arXiv:1712.08968.</p>
<p>[12] Dauphin, Y. N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli,
S.; and Bengio, Y. 2014. Identifying and attacking the
saddle point problem in high-dimensional non-convex optimization.
In Advances in neural information processing
systems, 2933–2941.</p>
<p>[13] Pennington, J., and Bahri, Y. 2017. Geometry of neural
network loss surfaces via random matrix theory. In International
Conference on Machine Learning, 2798–2806.</p>
<p>[14] Du, S. S.; Jin, C.; Lee, J. D.; Jordan, M. I.; Singh, A.; and
Poczos, B. 2017a. Gradient descent can take exponential
time to escape saddle points. In Advances in Neural Information
Processing Systems, 1067–1077.</p>
<p>[15] Jin, C.; Ge, R.; Netrapalli, P.; Kakade, S. M.; and Jordan,
M. I. 2017. How to escape saddle points efficiently. arXiv
preprint arXiv:1703.00887.</p>
<p>[16] Daneshmand, H.; Kohler, J.; Lucchi, A.; and Hofmann, T.</p>
<ol>
<li>Escaping saddles with stochastic gradients. arXiv
preprint arXiv:1803.05999.</li>
</ol>
<p>[17]Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. “A Critical View of Global Optimality in Deep Learning.” arXiv preprint arXiv:1802.03487 (2018).</p>Rudrajit Dasrudrajit1503@gmail.comThis post contains a summary of recent advances in non-convex optimization in deep learning, discussing about the optimality of local minima for several models, the issue of saddle points and modifications to stochastic gradient descent which are robust to saddle points.Theoretical Research in Deep Learning2018-09-12T00:00:00-07:002018-09-12T00:00:00-07:00https://rudrajit15.github.io/posts/2018/09/blog-post-1<p>This post contains some guidelines (gathered from self experience and also from some highly experienced people) for doing theoretical research in deep learning (and machine learning in general), strictly for newbies!</p>
<p>I am myself relatively new to this area, having an experience of about a year. In my Dual Degree Project (essentially my Master’s Thesis), I am working on some fundamental theoretical aspects of deep learning. Given that all my advisor’s PhD students are working on application based topics, my advsior was very keen on having new students work on some mathematical topics related to deep learning. And I couldn’t be any happier! Apart from this, in my fourth year, I took an R&D course in which I worked on some theoretical stuff on kernels in machine learning.</p>
<p>Before anyone ventures into the theoretical field of deep learning, a word of caution - it is <strong>NOT</strong> at all easy to make significant contributions in this field! One must have a really solid mathematical background as well as the patience to go through several highly non-trivial (and often extremely arduous) papers. It can get extremely frustrating at times due to several reasons such as not being able to come up with good ideas, not being able to develop a potential idea properly, wasting hours/days on something which lead to nothing etc. But if you are up for the challenge, it should be really enjoyable.</p>
<p>First of all, you have to refer to several theoretical papers from top AI conferences like ICML, NIPS, ICLR etc. since every year, there are tons of excellent and highly original theoretical papers published in these conferences. The papers in NIPS and ICML are not restricted to just deep learning and encompass several other domains, whereas ICLR is a bit more deep learning centred. Nevertheless, the quality of the theoretical papers is very high in all these conferences. The papers published in these top conferences give you a good idea of what are the theoretical problems of current relevance or importance in the AI community. In my case, I started off by skimming through the list of papers published in ICML 2017 and ICLR 2018. I glanced through the list of accepted papers and only looked at the abstracts of the ones whose title seemed interesting enough to me. This saves you a lot of time as the list is just huge. I mainly looked at papers on the expressive power of neural networks (ICLR 2018 had loads of them!) and the ones on optimization in general. So you could also choose some specific topics which you prefer.</p>
<p>Once you have referred to several (the quantification of ‘several’ is subjective) papers, you have to find gaps in the existing work or a significant extension/improvement of some one else’s work that (“to the best of your knowledge”) has not been attempted so far. A potential problem here can be struggling to come up with new ideas or developing a potential idea concretely. It happened to me as well. Talk to your advisor or any acquaintance who is actively involved in theoretical work about this. Often they can suggest you good ideas (due to their experience) or some other direction all together, which they are very optimistic about. I think experiments are really good stimuli to ideas, especially in deep learning. I have seen quite a few papers which perform interesting experiments to raise some specific issue, and then maybe it is resolved theoretically in that paper itself or in a subsequent paper. So for instance, you could possibly pick some specific algorithm and try to figure out cases where it fails, why it could be a potential problem in other cases too and possibly suggest some remedy to fix it. Additionally, some empirical modelling/observations/facts (please proceed with caution here!) could be used in conjunction with elaborate theory, especially for really complicated analysis which often comes up in deep learning.</p>
<p>Now that you have some really cool ideas, it is imperative that you do a thorough literature survey specific to your topic, in order to ascertain whether your idea has already not been published by other smart people. I say this from personal experience. In my fourth year, I spent two weeks conceiving a novel algorithm and its proof based on some ideas I had read up somewhere else, only to find out a week later (and that too after emailing it to my guide) that it was already published somewhere else! It was very disheartening and also a massive waste of time. Also it may happen that some guys have come up with something better than your idea, but you are not aware of it due to insufficient literature survey or in my case, because it just got published days before my submission deadline! So always be on the lookout for similar papers (i.e papers related to your topic), especially if you are working on something very recent.</p>
<p>Once you are sure that your super cool idea is completely novel, start writing it down properly. For instance, I just get way too carried away with ideas in my mind. It is only when I start writing them down properly, problems begin to show up! Also hand-wavy arguments are an absolute no-no in theoretical papers. I recommend writing it down in the form of a paper only (here I’m assuming that your ultimate goal is to get a publication), because doing so forces you to write each and every step properly and clearly, which will help you identify/spot the sketchy areas of your proofs etc. After writing it down properly, have your advisor and any math inclined person read all of it carefully. You do not want to put something incorrect in your paper! But more than the math itself, your paper should be lucid enough for other people to understand. Please realize that your idea comes off as completely new to them, and if it’s not presented well enough, you might not get favourable reviews/responses even though your idea is brilliant. So presentation plays a key role in theoretical papers. My advisor told me all of this, since the initial draft of my paper made very little sense to him. Needless to say, paper writing is a highly iterative process. You have to constantly make changes as you get it reviewed by others. Usually the final draft of your paper will significantly differ from your initial draft.</p>
<p>I hope these guidelines will serve useful to some of you. Best of luck!!</p>Rudrajit Dasrudrajit1503@gmail.comThis post contains some guidelines (gathered from self experience and also from some highly experienced people) for doing theoretical research in deep learning (and machine learning in general), strictly for newbies!