This short post is on the importance of having proper initialization while using alternating minimization for two variables, such that the objective function under consideration is convex in each variable individually but not jointly convex with respect to both the variables.
Alternating minimization (AM) has been succesfully used for solving several non-convex problems such as matrix completion, dictionary learning, image deblurring, EM algorithm, matrix factorization, etc. However, there is relatively less work on providing conditions under which it can converge to the optimal solution or close to it. This work here is a summary of some very nice theoretical work on AM. The authors present their results on 3 important problems namely matrix completion, phase retrieval and dictionary learning for which AM is used. The thing that stood out for me, however, was that they provide a good initialization for each problem separately (for phase retrieval they even provide a better initialization).
Relating with my work, I have used AM for 2 optimization problems so far. One was in my paper “Sparse Kernel PCA for Outlier Detection” where it is used to obtain the approximate sparse eigenvectors. In this case, we were lucky, since using the actual eigenvectors as the initial solution worked pretty well (nothing genius about that!). It is in the second problem, where I am struggling to find a good initialization. The second problem is non-linear blind compressed sensing, which is very similar to the dictionary learning problem except that there is also a sensing matrix term (which is known by the way) that is different for every training example and most importantly a non-linear transformation of the data. Also, there are further positivity constraints on the dictionary as well as the sparse codes due to the domain of the non-linear function in consideration. The sensing matrix is also not the usual Gaussian or Bernoulli ($\pm 1$) matrix ):
So far, I just try with different random initializations and use the one which gives the best results. Obviously, this is extremely inefficient. Thus, it would be very nice to have a systematic way of choosing a good initialization even for non-linear problems wherein AM is used. This would be also very helpful in obtaining theoretical guarantees on its performance.