Notes on Conditional Gradient Method

$$ \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}}\newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}}\newcommand{\bB}{\mathbf{B}} \newcommand{\bc}{\mathbf{c}}\newcommand{\bC}{\mathbf{C}} \newcommand{\bd}{\mathbf{d}}\newcommand{\bD}{\mathbf{D}} \newcommand{\be}{\mathbf{e}}\newcommand{\bE}{\mathbf{E}} \newcommand{\bff}{\mathbf{f}}\newcommand{\bF}{\mathbf{F}} \newcommand{\bg}{\mathbf{g}}\newcommand{\bG}{\mathbf{G}} \newcommand{\bh}{\mathbf{h}}\newcommand{\bH}{\mathbf{H}} \newcommand{\bi}{\mathbf{i}}\newcommand{\bI}{\mathbf{I}} \newcommand{\bj}{\mathbf{j}}\newcommand{\bJ}{\mathbf{J}} \newcommand{\bk}{\mathbf{k}}\newcommand{\bK}{\mathbf{K}} \newcommand{\bl}{\mathbf{l}}\newcommand{\bL}{\mathbf{L}} \newcommand{\bm}{\mathbf{m}}\newcommand{\bM}{\mathbf{M}} \newcommand{\bn}{\mathbf{n}}\newcommand{\bN}{\mathbf{N}} \newcommand{\bo}{\mathbf{o}}\newcommand{\bO}{\mathbf{O}} \newcommand{\bp}{\mathbf{p}}\newcommand{\bP}{\mathbf{P}} \newcommand{\bq}{\mathbf{q}}\newcommand{\bQ}{\mathbf{Q}} \newcommand{\br}{\mathbf{r}}\newcommand{\bR}{\mathbf{R}} \newcommand{\bs}{\mathbf{s}}\newcommand{\bS}{\mathbf{S}} \newcommand{\bt}{\mathbf{t}}\newcommand{\bT}{\mathbf{T}} \newcommand{\bu}{\mathbf{u}}\newcommand{\bU}{\mathbf{U}} \newcommand{\bv}{\mathbf{v}}\newcommand{\bV}{\mathbf{V}} \newcommand{\bw}{\mathbf{w}}\newcommand{\bW}{\mathbf{W}} \newcommand{\indep}{\perp \!\!\! \perp} \newcommand{\bx}{\mathbf{x}}\newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}}\newcommand{\bY}{\mathbf{Y}} \newcommand{\bz}{\mathbf{z}}\newcommand{\bZ}{\mathbf{Z}} \newcommand{\balpha}{\boldsymbol{\alpha}}\newcommand{\bAlpha}{\boldsymbol{\Alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}}\newcommand{\bBeta}{\boldsymbol{\Beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}}\newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}}\newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}}\newcommand{\bEpsilon}{\boldsymbol{\Epsilon}} \newcommand{\bzeta}{\boldsymbol{\zeta}}\newcommand{\bZeta}{\boldsymbol{\Zeta}} \newcommand{\beeta}{\boldsymbol{\eta}}\newcommand{\bEta}{\boldsymbol{\Eta}} % \beta already taken \newcommand{\btheta}{\boldsymbol{\theta}}\newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\biota}{\boldsymbol{\iota}}\newcommand{\bIota}{\boldsymbol{\Iota}} \newcommand{\bkappa}{\boldsymbol{\kappa}}\newcommand{\bKappa}{\boldsymbol{\Kappa}} \newcommand{\blambda}{\boldsymbol{\lambda}}\newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bmu}{\boldsymbol{\mu}}\newcommand{\bMu}{\boldsymbol{\Mu}} \newcommand{\bnu}{\boldsymbol{\nu}}\newcommand{\bNu}{\boldsymbol{\Nu}} \newcommand{\bxi}{\boldsymbol{\xi}}\newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bomikron}{\boldsymbol{\omikron}}\newcommand{\bOmikron}{\boldsymbol{\Omikron}} \newcommand{\bpi}{\boldsymbol{\pi}}\newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\brho}{\boldsymbol{\rho}}\newcommand{\bRho}{\boldsymbol{\Rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}}\newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btau}{\boldsymbol{\tau}}\newcommand{\bTau}{\boldsymbol{\Tau}} \newcommand{\bypsilon}{\boldsymbol{\ypsilon}}\newcommand{\bYpsilon}{\boldsymbol{\Ypsilon}} \newcommand{\bphi}{\boldsymbol{\phi}}\newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bchi}{\boldsymbol{\chi}}\newcommand{\bChi}{\boldsymbol{\Chi}} \newcommand{\bpsi}{\boldsymbol{\psi}}\newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bomega}{\boldsymbol{\omega}}\newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\nA}{\mathbb{A}} \newcommand{\nB}{\mathbb{B}} \newcommand{\nC}{\mathbb{C}} \newcommand{\nD}{\mathbb{D}} \newcommand{\nE}{\mathbb{E}} \newcommand{\nF}{\mathbb{F}} \newcommand{\nG}{\mathbb{G}} \newcommand{\nH}{\mathbb{H}} \newcommand{\nI}{\mathbb{I}} \newcommand{\nJ}{\mathbb{J}} \newcommand{\nK}{\mathbb{K}} \newcommand{\nL}{\mathbb{L}} \newcommand{\nM}{\mathbb{M}} \newcommand{\nN}{\mathbb{N}} \newcommand{\nO}{\mathbb{O}} \newcommand{\nP}{\mathbb{P}} \newcommand{\nQ}{\mathbb{Q}} \newcommand{\nR}{\mathbb{R}} \newcommand{\nS}{\mathbb{S}} \newcommand{\nT}{\mathbb{T}} \newcommand{\nU}{\mathbb{U}} \newcommand{\nV}{\mathbb{V}} \newcommand{\nW}{\mathbb{W}} \newcommand{\nX}{\mathbb{X}} \newcommand{\nY}{\mathbb{Y}} \newcommand{\nZ}{\mathbb{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cW}{\mathcal{W}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \DeclareMathOperator*{\argmax}{argmax~} \DeclareMathOperator*{\argmin}{argmin~} \DeclareMathOperator*{\Tr}{Tr} \DeclareMathOperator*{\Bias}{Bias} \DeclareMathOperator*{\Var}{Var} \newcommand{\Perp}{\perp\!\!\! \perp} \let\dsad\d \renewcommand\d{\, \mathrm{d}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Eb}[1]{\mathbb{E} \left[ #1 \right]} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} \newcommand{\vocab}[1]{\textbf{\color{blue} #1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\inner}[2]{ \left \langle #1, #2 \right \rangle } \newcommand{\ibra}[1]{\llbracket #1 \rrbracket} \newenvironment{nalign}{ \begin{equation} \begin{aligned} }{ \end{aligned} \end{equation} \ignorespacesafterend } \newcommand{\icol}[1]{ \left(\begin{smallmatrix}#1\end{smallmatrix}\right)% } \newcommand{\cc}[1]{\overline{#1}} \newcommand{\tr}[1]{\text{tr} \left( #1 \right)} \newcommand{\aff}{\text{aff} \,} \newcommand{\ri}{\text{ri} \, } \newcommand{\rb}{\text{rb} \, } \newcommand{\cl}{\text{cl} \, } \newcommand{\conv}{\text{conv} \, } \newcommand{\irow}[1]{ \begin{smallmatrix}(#1)\end{smallmatrix}% } \newcommand\abs[1]{\left \lvert #1 \right \rvert} $$

Motivation

In this post, we will be concerned with the following optimization problem

\[\begin{equation} \begin{split} \min_{\bx \in C} h(\bx), \end{split} \end{equation}\]

with convex (and compact!) set $C$ and L-smooth $h: \R^n \to (-\infty, +\infty]$.

We could try to solve the problem using Proximal (Projected in this case) Gradient Descent (PGD). For that, first we rewrite the constraint set as an indicator, yielding an optimization problem of the form

\[\begin{equation} \begin{split} \min_{\bx \in C} h(\bx) = \min_\bx \delta_C(\bx) + h(\bx). \end{split} \end{equation}\]

Since $h$ is L-smooth, as long as we can compute the proximal mapping of the $\delta_C$, we can find the solution through the iterative PGD procedure. We find the proximal mapping of the convex indicator function as follows.

Proximal mapping of the convex indicator function

\[\begin{equation} \begin{split} \text{prox}_{\tau \delta_C}(\bar \bx) = \argmin_{\bx} \delta_C(\bx) + \frac{1}{2\tau}\norm{\bx - \bar \bx}^2 = \argmin_{\bx \in C} \frac{1}{2\tau} \norm{\bx - \bar \bx}^2 = \text{proj}_C(\bar \bx). \end{split} \end{equation}\]

So at each step, the PGD takes a step into the negative gradient direction, and then projects the point onto the convex constraint set $C$. This can be summarized as follows.

Algorithm (Projected Gradient Descent) Given L-smooth $h: \R^n \to (-\infty, +\infty]$, convex set $C$, learning rate $\tau > 0$ and initial point $\bx_0 \in \R^n$, PGD follows iterative scheme given by

\[\begin{equation} \begin{split} \bx_{k+1} &= \text{prox}_{\tau \delta_C} \left ( \bx_k - \tau \nabla h(\bx_k) \right ) \\ &= \text{proj}_C \left ( \bx_k - \tau \nabla h(\bx_k) \right ). \end{split} \end{equation}\]

So the execution of PGD depends on whether we can compute the projection onto the convex set in reasonable time. Luckily, there exists an alternative optimization algorithm when this is not the case – the Conditional Gradient Descent.

Proximal Gradient Descent as iterative quadratic minimization problem

Similarly as Gradient Descent, PGD can be derived from rewriting the iterates from the algorithm as the minimum point of the quadratic approximation of the function (due to Descent Lemma). This can be summarized as follows.

(Equivalence of iterates under PGD) $\bx_{k+1}$ from PGD can be equivalently acquired as

\[\begin{equation} \begin{split} \bx_{k+1} = \argmin_{\bx \in C} h(\bx^k) + \inner{\nabla h(\bx^k)}{\bx - \bx^k} + \frac{1}{2\tau} \norm{\bx - \bx}^2. \end{split} \end{equation}\]

Proof. Multply by $2\tau$, add (constant) $\tau^2 \norm{\nabla h(\bx^k)}^2$, and complete the square.

Conditional Gradient Descent: The Method

The perspective of PGD as minimization of a local quadratic approximation is crucial for understanding Conditional Gradient Descent (CGD); as it turns out, CGD can be seen as exactly the same minimization problem, just without the quadratic term. We summarize the algorithm as follows.

Algorithm 2 (Conditional Gradient Descent) Given L-smooth $h: \R^n \to (-\infty, +\infty]$, compact convex set $C$, and initial point $\bx_0 \in \R^n$, CGD follows iterative scheme given by

\[\begin{equation} \begin{split} \by_k &= \argmin_{\bx \in C} \inner{\bx}{\nabla h(\bx_k)}, \\ \bx_k &= \tau_k \by_k + (1-\tau_k) \bx_k \quad (\text{with } \tau_k := \frac{2}{k+2}). \end{split} \end{equation}\]

One important aspect to note is that the optimization is over a compact convex set $C$. This is contrast to PGD that does not have this restriction. Looking at the optimization problem that $\by_k$ solves, the compactness restriction makes sense, as if the set was unbounded, the solution would not be finite.

Question: Boundedness of the set makes sense, but the closedness is a bit trickier. While it makes sense that for an open set the actual argmin would not be “computable”, I wonder whether that has practical implications.

Apart from the learning rate decay and convex interpolation between the “candidate” point $\by_k$ and the iterate at the previous step, CGD corresponds to PGD’s optimization problem up to the linear term.

In its intuitive form, the main difference from Conditional Gradient Descent (CGD) from PGD is the way it projects the points onto the convex set $C$. In PGD, we take a step in the negative gradient direction of the optimized function, and project the newly acquired point onto the convex constraint set. In CGD, we couple the gradient step and projection step together, by first choosing a point that minimizes a linear function given by the negative gradient, and then taking a step towards that point from the current iterate. Let’s break these steps down.

Step 1: Minimizing a linear function over convex constraints

In order to get the candidate point $\by_k$, we need to find a point minimizing the linear objective over the convex constraint set. This has a very simple solution and can be summarized by the following proposition.

Proposition 3 (Minimization of a linear function over compact convex set) Given a closed convex set $C \subset \R^n$ and $\bm \in \R^n$, the solution of

\[\begin{equation} \begin{split} \argmin_{\bx \in C} \inner{\bm}{\bx} \end{split} \end{equation}\]

is attained for a point(s) on the boundary of $C$.

Proof. (Idea) Note that from any point $\bx$, taking a step in $\bm$ direction increases the objective, e.g. $\inner{\bx + \bm}{\bm} - \inner{\bx}{\bm} = \norm{\bm}^2$. So taking a step in $\bm$ direction and then projecting onto the convex set $C$ does not decrease the objective.

(Details): Again take arbitrary step in $\bm$ direction by defining $\by := \bx + \bm$. Then denote by $\by$’s projection onto $C$ as

\[\begin{equation} \begin{split} \by^* = \text{proj}_C(\by) = \argmin_{\bz \in C} \norm{\bz - \by}^2. \end{split} \end{equation}\]

But also note that by Projection Theorem onto convex sets, $\by^*$ satisfies $\forall \bz \in C$:

\[\begin{equation} \begin{split} \inner{\bz - \by^*}{\by - \by^*} \leq 0, \end{split} \end{equation}\]

i.e. the vector formed from the projection to the point that is being projected forms an obtuse angle with all the vectors from the projected vector to any point in the convex set. Plugging in the original definition of $\by$, and since $\bx \in C$, we can replace $\bz$ by $\bx$, we get that

\[\begin{equation} \begin{split} & \inner{\bx - \by^*}{\bx + \bm - \by^*} \leq 0 \\ \Rightarrow \quad & \norm{\bx - \by^*}^2 + \inner{\bx - \by^*}{\bm} \leq 0 \\ \Rightarrow \quad & \inner{\by^* - \bx}{\bm} \geq \norm{\bx - \by^*}^2. \end{split} \end{equation}\]

But this means that the difference of the objectives at $\by^*$ and $\bx$,

\[\begin{equation} \begin{split} \inner{\by^*}{\bm} - \inner{\bx}{\bm} = \inner{\by^* - \bx}{\bm} \geq \norm{\bx - \by^*}^2, \end{split} \end{equation}\]

is non-negative, and thus moving towards $\bm$ increases the objective. Since the objective is linear and has a single maximum point at infinity, such movement can be repeated in the same direction with positive increase in the objective until a fixed point is reached.

Proposition 3 is the half the story of making CGD practical. The other half is the duality beteen sets and half-spaces. We know that each closed convex set has an equivalent representation as an intersection of closed half-spaces. Coupling this with the Proposition 3 gives us a hint to minimize the linear objective by “parts” – segments that make up the convex set. Though such a representation is now always possible to use practically, we present an example when such minimization “by parts” serves useful.

Corollary 4 (Minimizing linear objective over a convex hull of discrete points) Suppose a compact convex set $C \subset \R^n$ can be expressed as a convex hull of points in $\R^n$, i.e. $C = \text{conv } {\ba_i }_{i=1}^m$. Then for the optimization problem of the form

\[\begin{equation} \begin{split} S = \argmin_{\bx \in C} \inner{\bx}{\bm}, \end{split} \end{equation}\]

it holds that for $i^* = \argmin_{i} \inner{\ba_i}{\bm}$, $\ba_{i^*} \in S.$

Proof. Follows from Proposition 3.

With this in mind, if we are given an optimization problem over a constraint set that can be decomposed into a set or vertices, it is sufficient to evaluate the objective function at these finite number of points.

Step 2: Moving towards $\by_k$ in a decaying fashion

At each step, the candidate point the iterate $\bx_{k+1}$ is obtained by taking a convex combination of te point at previous step, $\bx_k$, and the candidate point obtained from the linear optimization problem, $\by_k$. Intuitively it makes sense that candidates $\by_k$ might not converge to a solution (trivial example being an optimization function that has its minimal point inside the constraint set). But this choice of moving towards the candidate point in smaller steps at each iteration is quite strange, though it makes the algorithm converge, so that is what we look at next.

Model function framework

While previously we proved convergence rate of Gradient Descent in a ad-hoc fashion, we can actually do it through a more general way: the model function framework. The idea is to define a model function (sort of a linearization of the objective), and a proximal model function (model function + a quadratic part), and show that they satisfy a few desired properties. Then, the convergence of the algorithms follow. We first show these desired properties and prove that when they are satisfied, convergence follows. Then, we show convergence of Gradient Descent and CGD.

A rough guideline for defining the model function is as follows.

Notation (Model and Proximal Model function) We denote the model function by $f_{\bar \bx}(\bx)$ (which has to do with linearization around $\bar \bx$ as is application-specific), and the proximal model function by $f_{\bar \bx}^{\tau}$ (which has to do with linearization + quadratization around $\bar \bx$, usually $f_{\bar \bx}^\tau(\bx) := f_{\bar \bx}(\bx) + \text{quadratic-part}$).

Here (I am not really sure how usual that is, just am aware of a few cases when this happens) we desire the following properties to hold of the model functions for them to be useful in Gradient Descent:

Properties 5 (Desired properties for model function framework to be useful in Gradient Descent):

$f \geq f_{\bx_k}$;
It holds that $f(\bx) - f_{ \bx_k}(\bx) \leq \frac{L}{2} \norm{\bx - \bx_k}^2$;
$f_{\bx_k}^{1/\tau}:= f_{\bx_k} + \frac{1}{2\tau}\norm{\cdot - \bx_k}^2$, which is strongly convex (with coefficient $1/\tau$);
$f_{\bx_k}^{1/\tau}$ has minimizer $\bx^{k+1}$.

Optimization objective decay lemma for gradient descent for convex functions

The following is the central lemma based on which the gradient descent’s convergence is proved (see my previous post for an ad-hoc derivation):

Theorem 6 (GD objective decay?) For a convex and L-smooth $f$, running GD with a step size $\tau > 0$, it holds for $\bx_k ,\bx_{k+1}$ (from the GD evaluations) and an arbitrary point $\bx$ that

\[\begin{equation} \begin{split} f(\bx_{k+1}) + \frac{1}{2\tau} \norm{\bx_{k+1} - \bx}^2 \leq f(\bx) + \frac{1}{2\tau} \norm{\bx_k - \bx}^2 + \left( \frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx_k - \bx_{k+1}}^2. \end{split} \end{equation}\]

Our goal now is to show that as long as the desired properties of the model function framework are satisfied, this Theorem can be shown to be true. Note that we prove the following by assuming a GD-like procedure, since we do not specify how iterates are obtained and whether the full function is differentiable everywhere.

Proposition 7 If the properties in Properties 5 are satisfied, the conclusion of Theorem 6 holds in a GD-like optimization problem for iterates $\bx_k$.

Proof. Note that due to strong convexity of $f_{\bx_k}^\tau$, and $\bx_{k+1}$ being the minimizer of the local quadratic function, it holds for all $\bx$:

\[\begin{equation} \begin{split} f_{\bx_k}^{1/\tau}(\bx_{k+1}) \leq f_{\bx_k}^{1/\tau}(\bx) - \frac{1}{2\tau} \norm{\bx - \bx_{k+1}}^2. \end{split} \end{equation}\]

Using the linear model error in Property 5(2) and the fact that proximal model function is the sum of the model function and a quadratic, we bound $f_{\bx_k}(\bx_{k+1})$ in the LHS as follows:

\[\begin{equation} \begin{split} f(\bx_{k+1}) + \frac{1}{2\tau}\norm{\bx_k - \bx_{k+1}}^2 - \frac{L}{2} \norm{\bx_{k+1} - \bx_k}^2 \leq f_{\bx_k}^{1/\tau}(\bx_{k+1}) & \leq f_{\bx_k}^{1/\tau}(\bx) + \frac{1}{2\tau} \norm{\bx - \bx_{k+1}}^2 \\ \end{split} \end{equation}\]

Finally, we bound the RHS by noting that

\[\begin{equation} \begin{split} f_{\bx_k}^{1/\tau}(\bx) = f_{\bx_k}(\bx) + \frac{1}{2\tau} \norm{\bx- \bx_{k}}^2, \end{split} \end{equation}\]

and $f_{\bx_k}(\bx) \leq f(\bx)$, by Property 5(1). This yields the final expression

\[\begin{equation} \begin{split} f(\bx_{k+1}) + \frac{1}{2\tau}\norm{\bx_k - \bx_{k+1}}^2 - \frac{L}{2} \norm{\bx_{k+1} - \bx_k}^2 \leq f(\bx) + \frac{1}{2\tau} \norm{\bx- \bx_{k}}^2 - \frac{1}{2\tau} \norm{\bx - \bx_{k+1}}^2. \end{split} \end{equation}\]

We have just showed that as long as the properties given in Properties 5 are satisfied, the Theorem 6 posits that Propsition 7 holds, which is the sufficient to show (linear) convergence of GD algorithm. As such, as long as we can frame the optimization problem as a composition of model and proximal model functions satisfying the properties, we get the convergence rate “for free”.

Example #1: Gradient Descent convergence rate based on model function framework

To show application of this approach, consider GD applied to a convex and L-smooth function $f$. If we define the model function as

\[\begin{equation} \begin{split} f_{\bx}(\by) := f(\bx) + \inner{\nabla f(\bx)}{\by - \bx}, \end{split} \end{equation}\]

and the proximal model function as

\[\begin{equation} \begin{split} f_{\bx}^{\alpha}(\by) := f_\bx(\by) + \frac{\alpha}{2} \norm{\bx - \by}^2, \end{split} \end{equation}\]

we can show that the properties in Properties 5 are satisfied:

By convexity of $f$, its linearization always minorizes $f$;
By Descent Lemma, the linearization error is bounded by a quadratic function;
Holds by definition;
Holds by GD objective function, i.e. $\bx_{k+1}$ is the minimizer of the local-quadratic function.

Thus, Proposition 7 holds and can be used to show the linear convergence.

Example #2: Proximal (Projected) Gradient Descent convergence rate based on model function framework

Assume a general optimizaton problem given by

\[\begin{equation} \begin{split} \min_{\bx} g(\bx) + h(\bx) =: f(\bx), \end{split} \end{equation}\]

with $g$ convex and having simple proximal mapping, and $h$ convex and L-smooth. The goal here is to find appropriate model and proximal model functions that satisfy the properties in Properties 5.

One such a pair is the following:

\[\begin{equation} \begin{split} f_\bx(\by) &:= h(\bx) + g(\by) + \inner{\nabla h(\bx)}{\by - \bx} \\ f^\alpha_\bx(\by) &:= f_\bx(\by) + \frac{\alpha}{2} \norm{\bx - \by}^2. \end{split} \end{equation}\]

Note that the properties are satisfied:

$\forall \, \bx, \by: f(\by) - f_\bx(\by) = g(\by)+ h(\by) - h(\bx) - g(\by) - \inner{\nabla h(\bx)}{\by - \bx} = h(\by) - (h(\bx) + \inner{\nabla h(\bx)}{\by - \bx}) \geq 0$. The inequality holds because $h$ is convex and is minorized by its Taylor expansion up to the first term.
From (1.), applying Descent Lemma on L-smooth $h$ shows the linearization error is bounded by a quadratic function.
By definition.
Holds by PGD objective function.

As such, once agan, the Proposition 7 is thus applicable and linear convergence rate follows.

Conditional Gradient Descent: convergence

To show the convergence rate for CGD we will have to do some more work. Since L-smoothness is