Notes on Matrix Completion Problem and Alternating Projection Methods

In this post, I explore the classical matrix completion problem through a projection method and present the convergence result in general case.

$$ \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}}\newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}}\newcommand{\bB}{\mathbf{B}} \newcommand{\bc}{\mathbf{c}}\newcommand{\bC}{\mathbf{C}} \newcommand{\bd}{\mathbf{d}}\newcommand{\bD}{\mathbf{D}} \newcommand{\be}{\mathbf{e}}\newcommand{\bE}{\mathbf{E}} \newcommand{\bff}{\mathbf{f}}\newcommand{\bF}{\mathbf{F}} \newcommand{\bg}{\mathbf{g}}\newcommand{\bG}{\mathbf{G}} \newcommand{\bh}{\mathbf{h}}\newcommand{\bH}{\mathbf{H}} \newcommand{\bi}{\mathbf{i}}\newcommand{\bI}{\mathbf{I}} \newcommand{\bj}{\mathbf{j}}\newcommand{\bJ}{\mathbf{J}} \newcommand{\bk}{\mathbf{k}}\newcommand{\bK}{\mathbf{K}} \newcommand{\bl}{\mathbf{l}}\newcommand{\bL}{\mathbf{L}} \newcommand{\bm}{\mathbf{m}}\newcommand{\bM}{\mathbf{M}} \newcommand{\bn}{\mathbf{n}}\newcommand{\bN}{\mathbf{N}} \newcommand{\bo}{\mathbf{o}}\newcommand{\bO}{\mathbf{O}} \newcommand{\bp}{\mathbf{p}}\newcommand{\bP}{\mathbf{P}} \newcommand{\bq}{\mathbf{q}}\newcommand{\bQ}{\mathbf{Q}} \newcommand{\br}{\mathbf{r}}\newcommand{\bR}{\mathbf{R}} \newcommand{\bs}{\mathbf{s}}\newcommand{\bS}{\mathbf{S}} \newcommand{\bt}{\mathbf{t}}\newcommand{\bT}{\mathbf{T}} \newcommand{\bu}{\mathbf{u}}\newcommand{\bU}{\mathbf{U}} \newcommand{\bv}{\mathbf{v}}\newcommand{\bV}{\mathbf{V}} \newcommand{\bw}{\mathbf{w}}\newcommand{\bW}{\mathbf{W}} \newcommand{\bx}{\mathbf{x}}\newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}}\newcommand{\bY}{\mathbf{Y}} \newcommand{\bz}{\mathbf{z}}\newcommand{\bZ}{\mathbf{Z}} \newcommand{\balpha}{\boldsymbol{\alpha}}\newcommand{\bAlpha}{\boldsymbol{\Alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}}\newcommand{\bBeta}{\boldsymbol{\Beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}}\newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}}\newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}}\newcommand{\bEpsilon}{\boldsymbol{\Epsilon}} \newcommand{\bzeta}{\boldsymbol{\zeta}}\newcommand{\bZeta}{\boldsymbol{\Zeta}} \newcommand{\beeta}{\boldsymbol{\eta}}\newcommand{\bEta}{\boldsymbol{\Eta}} % \beta already taken \newcommand{\btheta}{\boldsymbol{\theta}}\newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\biota}{\boldsymbol{\iota}}\newcommand{\bIota}{\boldsymbol{\Iota}} \newcommand{\bkappa}{\boldsymbol{\kappa}}\newcommand{\bKappa}{\boldsymbol{\Kappa}} \newcommand{\blambda}{\boldsymbol{\lambda}}\newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bmu}{\boldsymbol{\mu}}\newcommand{\bMu}{\boldsymbol{\Mu}} \newcommand{\bnu}{\boldsymbol{\nu}}\newcommand{\bNu}{\boldsymbol{\Nu}} \newcommand{\bxi}{\boldsymbol{\xi}}\newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bomikron}{\boldsymbol{\omikron}}\newcommand{\bOmikron}{\boldsymbol{\Omikron}} \newcommand{\bpi}{\boldsymbol{\pi}}\newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\brho}{\boldsymbol{\rho}}\newcommand{\bRho}{\boldsymbol{\Rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}}\newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btau}{\boldsymbol{\tau}}\newcommand{\bTau}{\boldsymbol{\Tau}} \newcommand{\bypsilon}{\boldsymbol{\ypsilon}}\newcommand{\bYpsilon}{\boldsymbol{\Ypsilon}} \newcommand{\bphi}{\boldsymbol{\phi}}\newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bchi}{\boldsymbol{\chi}}\newcommand{\bChi}{\boldsymbol{\Chi}} \newcommand{\bpsi}{\boldsymbol{\psi}}\newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bomega}{\boldsymbol{\omega}}\newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\nA}{\mathbb{A}} \newcommand{\nB}{\mathbb{B}} \newcommand{\nC}{\mathbb{C}} \newcommand{\nD}{\mathbb{D}} \newcommand{\nE}{\mathbb{E}} \newcommand{\nF}{\mathbb{F}} \newcommand{\nG}{\mathbb{G}} \newcommand{\nH}{\mathbb{H}} \newcommand{\nI}{\mathbb{I}} \newcommand{\nJ}{\mathbb{J}} \newcommand{\nK}{\mathbb{K}} \newcommand{\nL}{\mathbb{L}} \newcommand{\nM}{\mathbb{M}} \newcommand{\nN}{\mathbb{N}} \newcommand{\nO}{\mathbb{O}} \newcommand{\nP}{\mathbb{P}} \newcommand{\nQ}{\mathbb{Q}} \newcommand{\nR}{\mathbb{R}} \newcommand{\nS}{\mathbb{S}} \newcommand{\nT}{\mathbb{T}} \newcommand{\nU}{\mathbb{U}} \newcommand{\nV}{\mathbb{V}} \newcommand{\nW}{\mathbb{W}} \newcommand{\nX}{\mathbb{X}} \newcommand{\nY}{\mathbb{Y}} \newcommand{\nZ}{\mathbb{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cW}{\mathcal{W}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \DeclareMathOperator*{\argmax}{argmax~} \DeclareMathOperator*{\argmin}{argmin~} \DeclareMathOperator*{\Tr}{Tr} \DeclareMathOperator*{\Bias}{Bias} \DeclareMathOperator*{\Var}{Var} \newcommand{\Perp}{\perp\!\!\! \perp} \let\dsad\d \renewcommand\d{\mathrm{d}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Eb}[1]{\mathbb{E} \left[ #1 \right]} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} \newcommand{\vocab}[1]{\textbf{\color{blue} #1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\inner}[2]{ \left \langle #1, #2 \right \rangle } \newcommand{\ibra}[1]{\llbracket #1 \rrbracket} \newenvironment{nalign}{ \begin{equation} \begin{aligned} }{ \end{aligned} \end{equation} \ignorespacesafterend } \newcommand{\icol}[1]{ \left(\begin{smallmatrix}#1\end{smallmatrix}\right)% } \newcommand{\cc}[1]{\overline{#1}} \newcommand{\tr}[1]{\text{tr} \left( #1 \right)} \newcommand{\aff}{\text{aff} \,} \newcommand{\ri}{\text{ri} \, } \newcommand{\rb}{\text{rb} \, } \newcommand{\cl}{\text{cl} \, } \newcommand{\conv}{\text{conv} \, } \newcommand{\irow}[1]{ \begin{smallmatrix}(#1)\end{smallmatrix}% } \newcommand\abs[1]{\lvert #1 \rvert} $$

\[\begin{equation} \begin{split} \nabla_\bpi & \E_{p_\bpi} [\cL(\theta, \bm)] , \bm \sim \text{cat}(\bpi) \\ = & \int \nabla_\bpi {p_\bpi} (\bm) [\cL(\theta, \bm)] \d\bm \\ = &\int \nabla_\bpi \log{p_\bpi} (\bm) p_\pi(\bm) [\cL(\theta, \bm)] \d\bm \\ = & \E_{\bm \sim p_\pi} \left [\nabla_\bpi p_\pi(\bm) \cL(\btheta, \bm) \right] \approx \nabla_\bpi {p_\bpi} (\bm) [\cL(\theta, \bm)] \end{split} \end{equation}\]

Problem Statement

Given a a set (which I index from like a dictionary) of observations consisting of pairs $(i,j)$ $\bs$ that come from some unknown matrix $M \in \R^{m \times n}$ (say of $m$ users and $n$ options to rate), we are interested in filling in the missing entries given only the observations $\bs$. The main underlying assumption is that the matrix is simply a result of some low-dimensional projection onto the higher space and our goal is to recover that projection.

Formally, the problem can be written as rank-minimization problem in a way that reovers all the entries that are observed:

\[\begin{equation} \begin{split} \min_{X \in \R^{m\times n}} & \text{rank}(X) \\ \text{s.t. } & X_{i,j} = \bs_{i,j} \quad \forall (i,j) \in \bs \end{split} \end{equation}\]

However, this problem is hard to optimize, since rank is neither differentiable nor conex. To go around it, the rank function can be approximated. Because we know that $\text{rank}(M) = \text{rank}(M^\intercal M)$, and also

\[\begin{equation} \begin{split} \text{rank}(M) = \sum_{i=1}^{\text{min(m,n)}} \mathbb{1}_{\sigma_i > 0}, \end{split} \end{equation}\]

where $\sigma_i$ is the $i$-th singular value of $M$. We can approximate the rank function with a (convex) nuclear matrix norm $\norm{\cdot}_*$:

\[\begin{equation} \begin{split} \norm{M}_* = \tr{\sqrt{M^TM}} = \sum_{i=1}^{\text{min}(m,n)} \sigma_i, \end{split} \end{equation}\]

which can be seen by writing $M$ in its SVD form: $M = U\Sigma V^\intercal$. Then the problem simplifies, since this is a continuous convex objective (also there are some results that for some matrices optimizing this objective is equivalent of optimizing the rank problem).

Alternating Projection Method (Onto Convex Sets)

Prerequisites

We cover the basic results from Real Analysis that are useful in the convergence results later on.

Lemma 1 (Real bounded monotone sequence converges to its supremum/infimum) Any monotone bounded real sequence $({a}_n)_{n \in \nN}$ converges to its supremum/infimum $s = \sup \{ a_n \}_{n \in \nN}$.

Proof. Wlog assume that the sequence is non-decreasing. If the supremum is in the sequence, we are done. Otherwise, assume that the sequence does not contain its supremum. By definition of supremum, for any $\epsilon > 0$ there exists a $n$ s.t. $\forall m > n$ it holds that $\abs{a_m - s} < \epsilon$. But by definition of convergence, since for any $\epsilon > 0$ there exist elements that are withint $\epsilon$ distance to the supremum, it converges to $s$.

Theorem 1 (Bolzano–Weierstrass) A real bounded sequence $(a_n)_{n \in \nN}$ has a convergence subsequence.

Proof. We simply show that there exists a montone sequence, which by Lemma 1 proves its convergence. There are two cases to consider here:

Suppose there are infinitely many “peak” points – points that have the property that all the elements that follow it are less than (or equal) to them, i.e. a point $a_n$ is a peak if $\forall m > n$ it holds that $a_n \geq a_m$. If there are infinitely many such points we are done, since we can take these points as a monotone decreasing subsequence.
There are finitely many such points. Take the index $n$ to indicate last such point. Then there is no point with index $m > n$ that is as big as all proceeding points. This means that for any index $m$ there is a point $k > m$ that is at least as big than that point, i.e. $a_k \geq a_m$. Thus, we can take all such non-decreasing points as a subsequence.

Theorem 2 (Subsequences converge to the limit of the sequence) For a real convergent sequence $(a_n)_{n \in \nN}$ with limit $l$, any of its subsequences converge to $l$.

Proof. Consider any subsequence $(b_n)_{n \in \nN} \subset (a_n)_{n \in \nN}$. Then it holds that for orignal sequence for any $\epsilon > 0$ there exists a $N$ s.t. $\forall m > N$ it holds that $\abs{a_m - l} < \epsilon$. In turn, for any $m > N$ the same condition holds for $(b_n)_{n \in \nN}$, which imply convergence to the same limit point.

Convergence Guarantees

We now state the main result. The main idea is to show that at each iteration the alternate projections lead to decrease in distance to a point in the intersection by making use of Projection Theorem, which then shows boudedness of the sets.

Theorem 3 (Convergence of Alternating Projection Method) Let $C, D$ be convex sets with $C \cap D \neq \emptyset$, then the POCS converges to a point in the intersection $\bar x \in C \cap D$.

Proof. Let $c_k := \text{proj}_C(d_k)$ and $d_{k + 1} := \text{proj}_D(c_k)$. Then for an arbitrary point $\bar x \in C \cap D$, we have

\[\begin{equation} \begin{split} \norm{c^k - \tilde x}^2 = \norm{ (c^k - d^{k+1}) + (d^{k+1} - \tilde x^k)}^2 &= \norm{ (c^k - d^{k+1}) + (d^{k+1} + \tilde x^k)}^2 + 2\inner{c^k - d^{k+1}}{d^{k+1} - \tilde x^k} \\ &= \norm{ c^k - d^{k+1}}^2 + \norm{d^{k+1} - \tilde x^k}^2 - \underbrace{2\inner{c^k - d^{k+1}}{\tilde x^k - d^{k+1}}}_{\leq 0} \\ & \geq \norm{ c^k - d^{k+1}}^2 + \norm{d^{k+1} - \tilde x^k}^2. \end{split} \end{equation}\]

Where we used the Projection theorem to bound the inner product between points outside the convex set, projected point, and points outside it (that form an acute angle with all the points in the convex set). From this it follows that $\norm{d^{k+1} - \tilde x}^2 \leq \norm{c^k - \tilde x}^2 - \norm{c^k - d^{k+1}}^2$. Analogously repeating the same process with $d^{k+1}$ and $c^{k+1}$ we get that $\norm{c^{k+1} - \tilde x}^2 \leq \norm{d^{k+1} - \tilde x}^2 - \norm{d^{k+1} - c^{k+1}}^2$. Coupling both these equations together, this shows that

\[\begin{equation} \begin{split} \norm{c^{k+1} - \tilde x}^2 \leq \norm{c^{k} - \tilde x}^2 - \norm{c^k - d^{k+1}}^2 - \norm{d^{k+1} - c^{k+1}}^2. \end{split} \end{equation}\]

But this means that there is non-zero contraction of the length between sequence $(c_k)_{k \in \nN}$ and $\tilde x$, unless $c^{k} = d^{k+1}$ and $d^{k+1} = c^{k+1}$, in which case we are done since reprojecting these points would not change them.

Alternatively, we can show the convergence by considering subsequences constructed from the norms and showing there exists an indexing s.t. the sequence converges.

Convexity of the Sets

We have two sets of constraints: the rank constraint set $\cS_{\text{rank}}^r = \{ X \in \R^{m \times n} \, \mid \, \text{rank}(X) \leq r \}$, and the fidelity constraint set, sloppily written as $\cS_f^M = \{ X \in \R^{m \times n} \, \mid \, X_{i,j} = s_{i,j} \}$. If both sets are convex, we can simply use alternative projection algorithm to find a matrix that lies in the intersection between the two. As discussed before, $\cS_{\text{rank}}^r$ is not convex, so we approximate it with the nuclear norm constraint set $\cS_n^r = \{ X \in \R^{m \times n} \mid \, \norm{X}_*^r \leq r \}$.

$\cS_f^M$ is obviously convex. To show that $\cS_{n}^r$ is convex, we need to show that for $A, B \in \cS_n^r, \lambda \in [0,1]$ it holds that $\lambda A + (1-\lambda)B \in \cS_n^r$:

Claim 1 (Triangle inequality) For any real matrices $A, B \in \R^{m \times n}$ it holds that $\norm{A + B}_* \leq \norm{A}_* + \norm{B}_*$.

Proof.

\[\begin{equation} \begin{split} \norm{A + B}_* &= \tr{\sqrt{ (A + B)^\intercal (A + B)}} = \sum_{i = 1}^n \sqrt{\sum_{j=1}^n (a_{ji}^2 + b_{ji}^2 + 2b_{ji} a_{ji})}\\ &= \sum_{j=1}^n \norm{\ba_{*i} + \bb_{*i}}_2 \leq \sum_{i=1}^n \norm{\ba_{*i}}_2 + \norm{\bb_{*i}}_2 = \norm{A}_* + \norm{B}_* \end{split} \end{equation}\]

Projecting onto $\cS_n^r$ and $\cS_f^{\bs}$

Projecting onto the $\cS^s_f$ is easy. Throughout the post we assume projections based on Frobenius norm. In that way the projection onto the $\cS^s_f$ reads:

\[\begin{equation} \begin{split} \hat F := \text{proj}_{\cS^s_f} (F) = \argmin_{X \in \cS^s_f} \norm{X - F}_F^2 = \argmin_{X_{ij} = \bs_{ij}, \text{ where } \bs_{ij} \text{ is observed}} \norm{X - F}_F^2 \end{split} \end{equation}\]

which is equivalent to setting the $F$ elements to observed values in $\bs$, as this results in smallest possible projection error.

Projecting onto $\cS^\bs_f$ is trickier. To project an arbitrary matrix $N$ we need to solve the following problem:

\[\begin{equation} \begin{split} \hat N := \text{proj}_{\cS_f^\bs}(N) = \argmin_{X \in \cS_f^\bs} \norm{X - N}_F^2. \end{split} \end{equation}\]

Turns out this problem is harder than anticipated, and is actually out of scope of this post. For now, I refer to (missing reference) for conrete details as to how to implement it. The bottom line is that this projection is still achievable through SVD.

Solving the problem: finding intersection between $\cS_n^r$ and $\cS_f^{\bs}$

Once we know how to project on both convex sets, the solution to the problem becomes easy. Using the fact that as long as there exists a point in the intersection the alternating projection method converges, the algorithm is applicable.

But we do not know if there will be a point in the intersection. Luckily (not shown here), it turns out that when such a point does not exist, the algorithm actually converges to a set of points ($\bx^*, \by^*$) that are closest to both sets (in terms of having smallest possible distance between $\bx^* \in C, \by^* \in D$).