Gradient Descent I: Introduction and Convergence Analysis

$$ \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}}\newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}}\newcommand{\bB}{\mathbf{B}} \newcommand{\bc}{\mathbf{c}}\newcommand{\bC}{\mathbf{C}} \newcommand{\bd}{\mathbf{d}}\newcommand{\bD}{\mathbf{D}} \newcommand{\be}{\mathbf{e}}\newcommand{\bE}{\mathbf{E}} \newcommand{\bff}{\mathbf{f}}\newcommand{\bF}{\mathbf{F}} \newcommand{\bg}{\mathbf{g}}\newcommand{\bG}{\mathbf{G}} \newcommand{\bh}{\mathbf{h}}\newcommand{\bH}{\mathbf{H}} \newcommand{\bi}{\mathbf{i}}\newcommand{\bI}{\mathbf{I}} \newcommand{\bj}{\mathbf{j}}\newcommand{\bJ}{\mathbf{J}} \newcommand{\bk}{\mathbf{k}}\newcommand{\bK}{\mathbf{K}} \newcommand{\bl}{\mathbf{l}}\newcommand{\bL}{\mathbf{L}} \newcommand{\bm}{\mathbf{m}}\newcommand{\bM}{\mathbf{M}} \newcommand{\bn}{\mathbf{n}}\newcommand{\bN}{\mathbf{N}} \newcommand{\bo}{\mathbf{o}}\newcommand{\bO}{\mathbf{O}} \newcommand{\bp}{\mathbf{p}}\newcommand{\bP}{\mathbf{P}} \newcommand{\bq}{\mathbf{q}}\newcommand{\bQ}{\mathbf{Q}} \newcommand{\br}{\mathbf{r}}\newcommand{\bR}{\mathbf{R}} \newcommand{\bs}{\mathbf{s}}\newcommand{\bS}{\mathbf{S}} \newcommand{\bt}{\mathbf{t}}\newcommand{\bT}{\mathbf{T}} \newcommand{\bu}{\mathbf{u}}\newcommand{\bU}{\mathbf{U}} \newcommand{\bv}{\mathbf{v}}\newcommand{\bV}{\mathbf{V}} \newcommand{\bw}{\mathbf{w}}\newcommand{\bW}{\mathbf{W}} \newcommand{\bx}{\mathbf{x}}\newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}}\newcommand{\bY}{\mathbf{Y}} \newcommand{\bz}{\mathbf{z}}\newcommand{\bZ}{\mathbf{Z}} \newcommand{\balpha}{\boldsymbol{\alpha}}\newcommand{\bAlpha}{\boldsymbol{\Alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}}\newcommand{\bBeta}{\boldsymbol{\Beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}}\newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}}\newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}}\newcommand{\bEpsilon}{\boldsymbol{\Epsilon}} \newcommand{\bzeta}{\boldsymbol{\zeta}}\newcommand{\bZeta}{\boldsymbol{\Zeta}} \newcommand{\beeta}{\boldsymbol{\eta}}\newcommand{\bEta}{\boldsymbol{\Eta}} % \beta already taken \newcommand{\btheta}{\boldsymbol{\theta}}\newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\biota}{\boldsymbol{\iota}}\newcommand{\bIota}{\boldsymbol{\Iota}} \newcommand{\bkappa}{\boldsymbol{\kappa}}\newcommand{\bKappa}{\boldsymbol{\Kappa}} \newcommand{\blambda}{\boldsymbol{\lambda}}\newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bmu}{\boldsymbol{\mu}}\newcommand{\bMu}{\boldsymbol{\Mu}} \newcommand{\bnu}{\boldsymbol{\nu}}\newcommand{\bNu}{\boldsymbol{\Nu}} \newcommand{\bxi}{\boldsymbol{\xi}}\newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bomikron}{\boldsymbol{\omikron}}\newcommand{\bOmikron}{\boldsymbol{\Omikron}} \newcommand{\bpi}{\boldsymbol{\pi}}\newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\brho}{\boldsymbol{\rho}}\newcommand{\bRho}{\boldsymbol{\Rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}}\newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btau}{\boldsymbol{\tau}}\newcommand{\bTau}{\boldsymbol{\Tau}} \newcommand{\bypsilon}{\boldsymbol{\ypsilon}}\newcommand{\bYpsilon}{\boldsymbol{\Ypsilon}} \newcommand{\bphi}{\boldsymbol{\phi}}\newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bchi}{\boldsymbol{\chi}}\newcommand{\bChi}{\boldsymbol{\Chi}} \newcommand{\bpsi}{\boldsymbol{\psi}}\newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bomega}{\boldsymbol{\omega}}\newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\nA}{\mathbb{A}} \newcommand{\nB}{\mathbb{B}} \newcommand{\nC}{\mathbb{C}} \newcommand{\nD}{\mathbb{D}} \newcommand{\nE}{\mathbb{E}} \newcommand{\nF}{\mathbb{F}} \newcommand{\nG}{\mathbb{G}} \newcommand{\nH}{\mathbb{H}} \newcommand{\nI}{\mathbb{I}} \newcommand{\nJ}{\mathbb{J}} \newcommand{\nK}{\mathbb{K}} \newcommand{\nL}{\mathbb{L}} \newcommand{\nM}{\mathbb{M}} \newcommand{\nN}{\mathbb{N}} \newcommand{\nO}{\mathbb{O}} \newcommand{\nP}{\mathbb{P}} \newcommand{\nQ}{\mathbb{Q}} \newcommand{\nR}{\mathbb{R}} \newcommand{\nS}{\mathbb{S}} \newcommand{\nT}{\mathbb{T}} \newcommand{\nU}{\mathbb{U}} \newcommand{\nV}{\mathbb{V}} \newcommand{\nW}{\mathbb{W}} \newcommand{\nX}{\mathbb{X}} \newcommand{\nY}{\mathbb{Y}} \newcommand{\nZ}{\mathbb{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cW}{\mathcal{W}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \DeclareMathOperator*{\argmax}{argmax~} \DeclareMathOperator*{\argmin}{argmin~} \DeclareMathOperator*{\Tr}{Tr} \DeclareMathOperator*{\Bias}{Bias} \DeclareMathOperator*{\Var}{Var} \newcommand{\Perp}{\perp\!\!\! \perp} \let\dsad\d \renewcommand\d{\, \mathrm{d}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Eb}[1]{\mathbb{E} \left[ #1 \right]} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} \newcommand{\vocab}[1]{\textbf{\color{blue} #1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\inner}[2]{ \left \langle #1, #2 \right \rangle } \newcommand{\ibra}[1]{\llbracket #1 \rrbracket} \newenvironment{nalign}{ \begin{equation} \begin{aligned} }{ \end{aligned} \end{equation} \ignorespacesafterend } \newcommand{\icol}[1]{ \left(\begin{smallmatrix}#1\end{smallmatrix}\right)% } \newcommand{\cc}[1]{\overline{#1}} \newcommand{\tr}[1]{\text{tr} \left( #1 \right)} \newcommand{\aff}{\text{aff} \,} \newcommand{\ri}{\text{ri} \, } \newcommand{\rb}{\text{rb} \, } \newcommand{\cl}{\text{cl} \, } \newcommand{\conv}{\text{conv} \, } \newcommand{\irow}[1]{ \begin{smallmatrix}(#1)\end{smallmatrix}% } \newcommand\abs[1]{\left \lvert #1 \right \rvert} $$

The goal of these notes is to look into Gradient Descent and analyze its convergence rate in relatively general setting. Turns out that fairly much can be said about the convergence when only differentiability and L-smoothness are assumed. However, this does not guarantee convergence to a local optimimum. In order to go around this, we add convexity assumption and show that convergence indeed happens for sufficiently small step size.

We start with some prerequisites and then move on the real deal.

Prerequisites

Throughout the post we will assume that function $f$ is L-smooth, intuitivelly meaning that its gradients cannot change “too fast”.

Definition 1 (L-smoothness) A differentiable function $f : \R^n \to \R$ is called L-smooth if its gradients are Lipshitz-continuous everywhere with the (smallest) constant $L > 0$, i.e.

\[\begin{equation} \begin{split} \norm{\nabla f(\bx) - \nabla f(\by)} \leq L \norm{\bx - \by}. \end{split} \end{equation}\]

Alternatively, when $f$ is twice differentiable, it is useful to use the following characterization of L-smoothness.

Proposition 2 If $f : \R^n \to \R$ is L-smooth and twice differentiable, then \(\begin{equation} \begin{split} \nabla^2 f(\bx) \preceq L \nI \quad \forall \bx, \end{split} \end{equation}\)

i.e. all eigenvalues are greater or equal to $L$ when the Hessian is evaluated at any $\bx$ in the domain.

Proof. TODO

In its simplest form, Gradient Descent (GD) repeatedly takes steps into negative gradient direction and hopes that the overall objective function decreases. To show that negative gradient direction is indeed optimal, we need to quantify the “goodness” of a direction. We can do this by using the notions of directional derivative and and optimality condition using it.

Directional derivative

Defitinion 3 (Directional derivative) A directional derivative of a function $f$ in a direction $\bd$ is defined as

\[\begin{equation} \begin{split} f'(\bx; \bd) := \lim_{t \to 0} \frac{f(\bx + t\bd) - f(\bx)}{t}. \end{split} \end{equation}\]

Because we will make us of Taylor expansion, we define litte-o notation to help us truncate the series when dealing with limits.

Definition 4 Given $f, g : \R \to \R$ we say that $f \in o(g)$ if it holds that $\lim_{x \to 0} \frac{f(x)}{g(x)} = 0$.

Then we can work with directional derivatives in a more convenient way when the function $f$ is differentiable.

Proposition 5 For a differentiable $f$ at $\bx$, it holds that $f’(\bx, \bd) = \inner{\nabla f(\bx)}{\bd}$.

Proof. By definition,

\[\begin{equation} \begin{split} f'(\bx; \bd) &= \lim_{t \to 0} \frac{f(\bx + t\bd) - f(\bx)}{t} = \lim_{t \to 0} \frac{f(\bx) + \inner{\nabla f(\bx)}{t \bd} + o(t \norm{\bd}) - f(\bx)}{t} \\ & = \lim_{t \to 0} \frac{ t \inner{\nabla f(\bx)}{\bd} + o(t \norm{\bd}) }{t} = \inner{\nabla f(\bx)}{\bd}. \end{split} \end{equation}\]

where we used the Taylor expansion, and the fact that as $t \to 0$, $o(t \norm{\bd}) / t = 0$.

And now we can finally state an optimality condition for $\bx^*$ to be a local minimum in terms of the directional derivative.

Proposition 6 A point $\bx^*$ is a local minimum of a differentiable function $f$ iff $f’(\bx^*, \bd) \geq 0 \quad \forall \bd$.

Proof. ($\Rightarrow$): From Calculus it holds that $\nabla f(\bx^*) = 0$ if $\bx^*$ is a minimum, thus the forward direction holds using $f’(\bx^*, \bd) = \inner{\nabla f(\bx^*)}{\bd} = 0$. ($\Leftarrow$): Assuming it holds that $\inner{\nabla f(\bx)}{\bd} \geq 0$, then for $\bd = \nabla f(\bx) \Rightarrow \norm{\nabla f(\bx)}^2 \geq 0$, also for $\bd = -\nabla f(\bd)$ it holds that $\norm{\nabla f(\bx)} \leq 0$, thus $\nabla f(\bx) = 0$.

Optimal (local) direction

Finally, we can use the facts above to arrive at the conclusion that taking $d = -\nabla f(\bx)$ direction is the direction that achieves the highest descent. Formally, we are interested in such a local direction $\bd$ (with $\norm{\bd} = 1$) s.t. the function decreases the most. This can simply be reduced to a 1D-derivative case, as we only want to find the locally highest descent. This can be formalized as

\[\begin{equation} \begin{split} \bd^{\text{optimal}} = \argmin_{\norm{\bd} = 1} f'(\bx, \bd) = \argmin_{\norm{\bd} = 1} \inner{\nabla f(\bx)}{\bd} = \argmin_{\norm{\bd} = 1} \norm{\nabla f(\bx)} \norm{\bd} \cos \theta = - \norm{\nabla f(\bx)}^2, \end{split} \end{equation}\]

the the conclusion follows since $\norm{\bd} = 1$ and we can take $-\nabla f(\bx)$ to reach the lower bound of the minimization.

Gradient descent

We will briefly introduce the gradient descent algorithm, and then analyse its convergence in a general L-smooth setting.

The algorithm

The consideration of optimal local direction suggests an algorithm: take small steps in the greatest descent direction and repeat until convergence. But there is a caveat when it comes to the actual step size. In the above, we implicitly assumed that the step is infinitesimally small, which cannot be implemented in practice. However, assuming we can actually select a sufficiently small size (the conditions for which will be explained later), the Vanilla Gradient Descent algorithm can be formalized in (for now) almost assumption-free way, which does not guarantee convergence.

Algorithm (Vanilla Gradient Descent (VGD)) For a differentiable function $f: \R^n \to \R$, a step size $\tau > 0$, and a starting point $x^0 \in \text{dom }f$ at each iteration step $t$, VGD updates the current iterate $\bx^{t-1}$ by the update rule $\bx^t := \bx^t - \tau \nabla f(\bx^t)$.

To analyse its convergence properties, we need to make some assumptions. In particular, from now on we assume that $f(\cdot)$ is L-smooth (its curvature is well-behaved) and bounded from below (there exists a finite minimum). These assumptions go into the Descent lemma, a central player in the analysis of convergence properties for many variations of gradient descent algorithms.

Convergence analysis: non-convex case

Key lemma: the Descent Lemma

In its intuitive form, the Descent Lemma states that L-smooth functions always exhibit two quadratic function with the same quadratic coefficient (but location-dependent linear and constant terms) that define a minorizer and majorizer of the function at every point $x$. It turns out these bounds are globally-tightest, meaning that no other quadratic functions are “as wide” as those supplied by the Descent Lemma. Note that this is also true globally, and locally one can find “wider” quadratic functions. This connection is important, as it will turn out that Gradient Descent is doing nothing more than minimizing a quadratic objective at each step. Formally, the Descent Lemma can be stated as follows.

Lemma 7 (Descent Lemma) For a L-smooth function $f$ it holds at every point $\bar x$ that

\[\begin{equation} \begin{split} f(\bar \bx) + \inner{\nabla f(\bx)}{\bx - \bar \bx} - \frac{L}{2} \norm{\bx - \bar \bx}^2 \leq f(\bx) \leq f(\bar \bx) + \inner{\nabla f(\bx)}{\bx - \bar \bx} + \frac{L}{2} \norm{\bx - \bar \bx}^2 \quad \forall \bx. \end{split} \end{equation}\]

Proof. (source). Consider a 1D function $h(t) = f(\bar \bx + t (\bx - \bar\ \bx))$, that defines a line from $\bar \bx$ to $\bx$ (for $t \in [0,1]$). For convenience define $\bz_t := \bar \bx + t (\bx - \bar \bx)$, then it follows, starting from the fundamental theorem of Calculus:

\[\begin{equation} \begin{split} f(\bx) - f(\bar \bx) = h(1) - h(0) = \int_0^1 h'(t) \d t = \int_0^1 \inner{\nabla_{\bz_t} f(\bz_t)}{\bx - \bar \bx} \d t \end{split} \end{equation}\]

then subtracting the starting derivative $h’(0) = \inner{\nabla_{\bar \bx} f(\bx)}{\bx - \bar \bx}$ from the both sides

\[\begin{equation} \begin{split} & \quad f(\bx) - f(\bar \bx) - \inner{\nabla_{\bar \bx} f(\bar \bx)}{\bx - \bar \bx} &&= \int_0^1 \inner{\nabla_{\bz_t} f(\bz_t) - \nabla_{\bar \bx} f(\bar \bx)}{\bx - \bar \bx} \d t \\ \Rightarrow & \quad \abs{f(\bx) - f(\bar \bx) - \inner{\nabla_{\bar \bx} f(\bar \bx)}{\bx - \bar \bx}} &&= \abs{\int_0^1 \inner{\nabla_{\bz_t} f(\bz_t) - \nabla_{\bar \bx} f(\bar \bx)}{\bx - \bar \bx} \d t} \\ & && \leq \int_0^1 \abs{\inner{\nabla_{\bz_t} f(\bz_t) - \nabla_{\bar \bx} f(\bar \bx)}{\bx - \bar \bx}} \d t \\ & && \leq \int_0^1 \underbrace{\norm{\nabla_{\bz_t} f(\bz_t) - \nabla_{\bar \bx} f(\bar \bx)}}_{\leq L \norm{\bz_t - \bar \bx} = L t \norm{\bx - \bar \bx}} \norm{\bx - \bar \bx} \d t \qquad \text{(Cauchy-Schwarz)} \\ & && = \frac{L}{2} \norm{\bx - \bar \bx}^2, \end{split} \end{equation}\]

as desired.

Alternative Proof. Based on (missing reference) (for infinitely differentiable $f$?). Their proof does not make sense to me, as to me it seems like it can only work if the function is either exactly equal to its Taylor expansion up to second term, or the difference between the considered point is negligible.

Stopping criterion for Gradient Descent and guaranteed descent

One popular method to stop the GD procedure is based on the norm of the gradients. Concretely, we stop the procedure when for some $k$ it holds that $\norm{\nabla f(\bx^k)} \leq \epsilon$. Turns out this is a good idea only for the strongly convex case (see (missing reference)) (for non-convex functions one can construct many “almost” local-minimum points). However, if we ignore this caveat and proceed with this termination option, we can actually bound the rate of convergence without assumptions of convexity. For that, we first show guaranteed decrease of function value, and then bound the actual gradient step.

Proposition 8 (GD Step Bound) GD with learning rate $\tau$ on a L-smooth $f$ satisfies

\[\begin{equation} \begin{split} f(\bx^{k+1}) \leq f(\bx^k) - \left(\frac{1}{\tau} - \frac{L}{2} \right) \norm{\nabla f(\bx^k)}^2. \end{split} \end{equation}\]

Proof. Plug in $\bar \bx = \bx^k - \tau \nabla f(\bx^k)$ and $\bar \bx = \bx^k$ into the Descent Lemma.

Corollary 9 (Guaranteed Descent) GD with $\tau = 1/L$ on a L-smooth $f$ satisfies

\[\begin{equation} \begin{split} f(\bx^{k+1}) \leq f(\bx^{k}) - \frac{1}{2L}\norm{\nabla f(\bx^k)}^2. \end{split} \end{equation}\]

Proof. Follows from GD Step Bound Proposition.

Remark: Note that guaranteed descend is also obtained for $\tau < \frac{2}{L}$, but we use the more restricted version of step size $\tau = \frac{1}{L}$, because it makes the convergence analysis easier (..right?).

Claim 10 GD on a L-smooth function $f$ with a learning rate of $\tau = \frac{1}{L}$ reaches $\norm{\nabla f(\bx^k)}^2 \leq \epsilon$ for some $k$ in $\cO(\frac{1}{\epsilon})$ iterations .

Proof. Sum up both sides in the Guaranteed Descent Lemma, up to some $t$ and get

\[\begin{equation} \begin{split} f(\bx^{t+1}) - f(\bx^0) \geq \frac{1}{2L} \sum_{l=1}^t \norm{\nabla f(\bx^l)}^2 \geq \frac{t}{2L} \min_{i} \left \{ \norm{\nabla f(\bx^i)}^2 \right \}, \end{split} \end{equation}\]

where we used the telescoping nature of the left side, and bounded the right side with minimum gradient-norm of the sum. We can then bound the left side to make it independent of $\bx_0$ (to have a proper constant to bound) by replacing $f(\bx^0)$ with $f(\bx^*) := \inf_x f(\bx)$. This then yields

\[\begin{equation} \begin{split} f(\bx^{t+1}) - f(\bx^*) \geq f(\bx^{t+1}) - f(\bx^0) \geq \frac{t}{2L} \min_{i} \left \{ \norm{\nabla f(\bx^i)}^2 \right \}, \end{split} \end{equation}\]

and from this it follows that \(\begin{equation} \begin{split} \frac{2L(f(\bx^{t+1}) - f(\bx^*))}{t} = \cO \left(\frac{1}{t} \right) \geq \min_{i} \left \{ \norm{\nabla f(\bx^i)}^2 \right \}. \end{split} \end{equation}\) Thus the minimum gradient-norm drops sublinearly. This implies that the error drops sublinearly in terms of iterations $t$. This means that by setting the left side to $\epsilon$, it will yield that

\[\begin{equation} \begin{split} & \quad \epsilon = \frac{2L(f(\bx^{t+1}) - f(\bx^*))}{t} \geq \min_{i} \left \{ \norm{\nabla f(\bx^i)}^2 \right \} \\ \Rightarrow & \quad t = \cO \left(\frac{1}{\epsilon} \right) \geq \min_{i} \left \{ \norm{\nabla f(\bx^i)}^2 \right \}, \end{split} \end{equation}\]

and so alternatively we can state that the minimum gradient-norm drops off sublinearly in error $\epsilon$.

Interpretation of Gradient descent: iterative quadratic-minimization procedure

One perspective that it both useful to gain intuition into GD but also proves to be useful in future results, is that GD can be seen as iteratively taking steps that minimize a quadratic function from the current iterate $x^{k}$. This quadratic corresponds to exactly the first term of the Taylor expansion (i.e. the linear term corresponds to the slope of the function evaluated at $\bx^k$), while the quadratic term is exactly $\frac{1}{\tau}$. I only present a verification of this fact:

\[\begin{equation} \begin{split} &\argmin_{\bx} f_{\bx^k}^{\frac{1}{\tau}}(\bx) = \argmin_\bx f(\bx^k) + \inner{\nabla f(\bx^k)}{\bx - \bx^k} + \frac{1}{2\tau} \norm{\bx^k - \bx}^2 \\ &\Rightarrow \quad \nabla_\bx \left [ \inner{\nabla f(\bx^k)}{\bx - \bx^k} + \frac{1}{2\tau} \norm{\bx^k - \bx}^2 \right] \overset{!}{=} 0 \\ & \rightsquigarrow \quad \bx = \bx^k - \tau \nabla f(\bx^k) =: \bx^{k+1}. \end{split} \end{equation}\]

Convergence analysis: convex case

Of course the stopping criterion based on the gradient norm is flawed. When the function is not convex, we can imagine a simple function that becomes arbitrary flat but still is far away from the actual (local) minimum. If we add convexity, we can then bound the actual convergence to the actual global minimum. For this, however, one result relating points from gradient descent iteration $(\bx^k, \bx^{k+1})$ and an arbitrary point $\bx$ is required. This result makes use of all the assumptions of the function $f$ and GD in this case: the function is L-smooth, convex, and we run GD with sufficiently small learning rate.

Lemma relating $\bx^k, \bx^{k+1}$ and any arbitrary point $\bx$ in convex setting

As mentioned, the main assumptions now are that $f$ is convex and L-smooth. For convenience we define the following terms.

Defintion 11 We denote the linearization around a point $\bx$ as $f_{\bx}(\by) := f(\by) + \inner{\nabla f(\bx)}{\by - \bx}$ and the quadratic with constant $c$ to be $f_\bx^{c}(\by) := f_\bx(y) + \frac{c}{2} \norm{\by - \bx}^2$.

Our main goal is to relate the three points in a way that only involes the differences between the points. We can bound the error $f(\bx^{k+1}) - f(\bx)$ by noticing that the first term is always less than the quadratic around the point $\bx_k$ (due to Descent Lemma the quadratic with constant L always lies above the graph), and the second term is at least as high as linearization from any point (due to conexity), we take this point to be $\bx$ and can bound the difference as

\[\begin{equation} \begin{split} f(\bx^{k+1}) - f(\bx) \leq f^{L}_{\bx^k}(\bx^{k+1}) - f_{\bx^k}(\bx). \end{split} \end{equation}\]

Now we want to squeeze the learning-rate dependent term. Note that if we subtract another quadratic at the same point, all the parts except for the quadratic will cancel. As such, we can write $f_{\bx^k}^L(\bx^{k+1}) - f_{\bx^k}^{\frac{1}{\tau}}(\bx^{k+1}) = (\frac{L}{2} - \frac{1}{2\tau}) \norm{\bx^k - \bx^{k+1}}^2$, from here re-express $f_{\bx^k}^L$, and write the full expression as

\[\begin{equation} \begin{split} f^{L}_{\bx^k}(\bx^{k+1}) - f_{\bx^k}(\bx) = f_{\bx^k}^{\frac{1}{\tau}}(\bx^{k+1}) + \left(\frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2 - f_{\bx^k}(\bx). \end{split} \end{equation}\]

Now we make use of the fact that $\bx_{k+1}$ is the minimizer of the $\frac{1}{\tau}$ quadratic and also this quadratic is strongly-convex (TODO: define what this means :)). To exploit this, we make use of the fact that if some $g$ is strongly convex with parameter $\mu$, then for any $x$, and the minimizer $x^*$ it holds that $g(x) \geq g(x^*) + g’(x)(x-x^*)$ (and analogously for multi-dimensional functions). Using this we bound the quadratic $f^{\frac{1}{\tau}}_{\bx^k}$ using the minimizer $\bx^{k+1}$ and yield

\[\begin{equation} \begin{split} f^{L}_{\bx^k}(\bx^{k+1}) - f_{\bx^k}(\bx) &\leq f_{\bx^k}^{\frac{1}{\tau}}(\bx) - \frac{1}{2\tau} \norm{\bx - \bx^{k+1}}^2 + \left(\frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2 - f_{\bx^k}(\bx) \\ &= f_{\bx^k}(\bx) + \frac{1}{2\tau} \norm{\bx - \bx^{k}}^2 + \frac{1}{2\tau} \norm{\bx - \bx^{k+1}}^2 + \left(\frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2 - f_{\bx^k}(\bx) \\ &= \frac{1}{2\tau} \norm{\bx - \bx^{k}}^2 + \frac{1}{2\tau} \norm{\bx - \bx^{k+1}}^2 + \left(\frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2 \end{split} \end{equation}\]

where also the linear part of the last got cancelled due to the expansion of the quadratic. Finally, we can join in together the original left size and arrive at the final expression:

\[\begin{equation} \begin{split} f(\bx^{k+1}) - f(\bx) \leq \frac{1}{2\tau} \norm{\bx - \bx^{k}}^2 + \frac{1}{2\tau} \norm{\bx - \bx^{k+1}}^2 + \left(\frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2 . \end{split} \end{equation}\]

This result is so important what we state it as a Theorem.

Theorem 12 For a convex and L-smooth $f$, running GD with a step size $\tau > 0$, it holds for $\bx^k ,\bx^{k+1}$ (from the GD evaluations) and an arbitrary point $\bx$ that

\[\begin{equation} \begin{split} f(\bx^{k+1}) + \frac{1}{2\tau} \norm{\bx^{k+1} - \bx}^2 \leq f(\bx) + \frac{1}{2\tau} \norm{\bx^k - \bx}^2 + \left( \frac{L}{2} - \frac{1}{2\tau} \right) \norm{\bx^k - \bx^{k+1}}^2. \end{split} \end{equation}\]

Proof. See right above.

Convergence rate of GD for convex functions

The approach to yield the convergence rate to the set of global minimum will be similar to that of bounding the gradient norm. The main idea is to use the Convex GD Theorem and bound the iterate errors by a constant using telescoping sums.

This result can be summarized as the following theorem.

Theorem 13 (GD Convergence Rate for Convex Functions) Given a L-smooth convex function $f$, GD with a step size $\tau < \frac{1}{L}$ converges to the optimal value $f^*$ with a rate of $\cO \left ( \frac{1}{n} \right )$ in iterations $n$.

Proof. First of all, note that by Corollary 9, the iterate values decrease, i.e. $f(\bx^{k+1}) \leq f(\bx^k)$. Then simply use the Theorem 12 for convex functions by plugging in $\bx \leftarrow \bx^*$ and get

\[\begin{equation} \begin{split} f(\bx^{k+1}) - f(\bx^*) \leq \frac{1}{2\tau} \norm{\bx^k - \bx^*} - \frac{1}{2\tau} \norm{\bx^{k+1} - \bx^*}^2, \end{split} \end{equation}\]

where note that by assumption the term that we dropped was negative (since $\tau < 1/L$), and thus it can be upper-bounded. Now sum up both sides with varying $k$ and note that RHS collapses to just two terms due to telescoping sum:

\[\begin{equation} \begin{split} \sum_{t=1}^n (f(\bx^{t+1}) - f(\bx^*)) \leq \frac{1}{2\tau} \norm{\bx^0 - \bx^*} - \frac{1}{2\tau} \norm{\bx^{k+1} - \bx^*}^2 \leq \frac{1}{2\tau} \norm{\bx^0 - \bx^*}^2, \end{split} \end{equation}\]

but now note that the LHS can be lower-bounded due to increasingness of function values, thus

\[\begin{equation} \begin{split} & n(f(\bx^{n+1}) - f(\bx^*)) \leq \sum_{t=1}^n (f(\bx^{t+1}) - f(\bx^*)) \leq \frac{1}{2\tau} \norm{\bx^k - \bx^*} - \frac{1}{2\tau} \norm{\bx^{k+1} - \bx^*}^2 \leq \frac{1}{2\tau} \norm{\bx^0 - \bx^*}^2 \\ \Rightarrow \quad & f(\bx^{n} - f(\bx^*)) \leq \frac{\frac{1}{2\tau} \norm{\bx^0 - \bx^*}^2}{n} = \cO(1/n). \end{split} \end{equation}\]



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Notes on "Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions" (2018)
  • Notes on "Attention Is All You Need" (2017)
  • Notes on Conditional Gradient Method
  • Notes on Fenchel Conjugate and Duality
  • Comparing Non-Smooth Optimization through Subgradients on Regularized Logistic Regression Problem