Gradient Descent II: Case for Strongly Convex Functions

In this post I continue looking at Gradient Descent (GD), but now consider its application on strongly convex functions.

Convergence rates

It is useful to quantify how fast a sequence converges to its limit point. We denote the error $e_t$ at timestep $t$ to be defined as the difference between the function value at iteration $t$ due to GD, $f(x^t)$, and the function value to which GD converges, $f^*$. Formally, we write the error as $e_t := f(x^t) - \lim_{k\rightarrow\infty} f(x^k)$. Then we distinguish the rate of convergence as the limit of ratio of errors.

Definition (Rate of Convergence) For a convergent sequence $(a_n)_{n \in \nN}$, we say its rate of convergence is a constant defined as

\[\begin{equation} \begin{split} \rho := \lim_{t \to \infty } \frac{e^{t+1}}{e^t}. \end{split} \end{equation}\]

We can then distinguish between three rates of convergence and make some comments about their behavior in the limit:

  • Sublinear, when $\rho = 1$. This means that the progress made each iteration is roughly constant.
  • Linear, when $\rho \in (0, 1)$. This means that, roughly speaking, $e^{k+1} \approx \rho e^k$ and the error drops off linearly.
  • Superlinear, when $p = 0$. The error drops faster than an error bounded by a linear function.

GD convergence rate

Last time we showed that the error of GD for convex functions is bounded by $e^k = \cO(\frac{1}{k}), e^{k+1} = \cO(\frac{1}{k+1})$ with the same constant. This means that $\rho = 1$, and we have sublinear convergence, which is very slow in practise. To go around this problem, we will see that a linear rate can be achieved by considering strongly convex functions.

Application: GD with functions satisfying Polyak-Lojasiewicz inequality

Turns out that if a L-smooth function satisfies Polyak-Lojasiewicz (PL) inequality, we can show that GD converges linearly. We first define relevant terms and prove this statement.

Definition (Polyak-Lojasiewicz inequality) A differentiable function $f : \R^n \to \R$ is said to satify PL inequality if it holds for $\bx^* \in \argmin_{\bx} f(\bx)$ and any $\bx$ that

\[\begin{equation} \begin{split} \frac{1}{2} \norm{\nabla f(\bx)}^2 \geq \mu(f(\bx) - f(\bx^*)), \end{split} \end{equation}\]

with positive $\mu > 0$. In other words, the inequality holds if the gradient grows quadratically.

Intuitively it should make sense that such functions should be easier to optimize, because such functions are getting “steeper and steeper” as we move away from a minimum. As such, taking a step in negative gradient direction would move us towards the minimum more if we far away from it, rather than closer.

With this in mind, we show that any L-smooth differentiable (not necessarily convex) function satisfying PL inequality converges towards a minimum linearly.

Proposition (Linear Convergence of PL Functions) GD on a L-smooth function satisfying PL inequality with constant $\mu > 0$ converges linearly with a step size $\tau \leq \frac{1}{L}$.

Proof. In the last post we derived the Guaranteed Descend Lemma stating that for $\tau \leq \frac{1}{L}$ it holds that

\[\begin{equation} \begin{split} f(\bx^{k+1}) \leq f(\bx^k) - \frac{1}{2L} \norm{\nabla f(\bx^k)}^2, \end{split} \end{equation}\]

here we can bound the right side with the PL inequality as

\[\begin{equation} \begin{split} f(\bx^{k+1}) \leq f(\bx^k) - \frac{\mu}{L} (f(\bx^k) - f(\bx^*)), \end{split} \end{equation}\]

now subtract $f(\bx^*)$ (where $\bx^* \in \argmin_{\bx} f(\bx)$) from the both sides and get

\[\begin{equation} \begin{split} f(\bx^{k+1}) - f(\bx^*) \leq \left ( 1 - \frac{\mu}{L} \right ) (f(\bx^k) - f(\bx^*)), \end{split} \end{equation}\]

now note the application of the same process to $\bx^k$ and $\bx^{k-1}$:

\[\begin{equation} \begin{split} f(\bx^{k}) - f(\bx^*) \leq \left ( 1 - \frac{\mu}{L} \right ) (f(\bx^{k-1}) - f(\bx^*)). \end{split} \end{equation}\]

We can now see the recursive relationship which we can trace back to $\bx^0$ by repeated substitution as

\[\begin{equation} \begin{split} f(\bx^{k+1}) - f(\bx^*) \leq \left ( 1 - \frac{\mu}{L} \right )^{k+1} (f(\bx^0) - f(\bx^*)), \end{split} \end{equation}\]

which shows that if $ \frac{\mu}{L} < 1$, we get linear convergence rate. Actually, the $\mu$ and $L$ can be seen as the smallest and largest eigenvalues of the Hessian globally, respectively. As such, the condition $\frac{\mu}{L} < 1$ does not hold only in the case when all the eigenvalues are the same. When $\mu = L$, we can still show linear convergence rate, but need to use strongly-convex-specific tools, which we do not look into.

Application: GD with strongly convex functions

We can show that PL inequality holds for strongly-convex functions. Additionally, it holds that $\mu$ in PL inequality corresponds to the strong convexity coefficient. It also holds that $\mu \leq L$, a fact that we show first.

Proposition 2 A differentiable $\mu$-strongly convex function $f$ satisfies

\[\begin{equation} \begin{split} f(\by) \geq f(\bx) + \inner{\nabla f(\bx)}{\by - \bx} + \frac{\mu}{2}\norm{\by - \bx}^2. \end{split} \end{equation}\]

Proof. (Idea) By definition of strong convexity, $f - \frac{\mu}{2}$ is a convex function. Coupling it with the fact that differentiable convex functions are minorized at each point in their domain by the Taylor expansion up to the first term, we get the desired inequality.

Proposition 3 For a $L$-smooth $\mu$-strongly convex function it holds that $\mu \leq L$.

Proof. Note that by Descent Lemma it holds that

\[\begin{equation} \begin{split} f(\by) \leq f(\bx) + \inner{\nabla f(\bx)}{\by - \bx} + \frac{L}{2}\norm{\by - \bx}^2 \quad \forall \bx, \by. \end{split} \end{equation}\]

Coupling this with Proposition 2 gives the result.

This result can be seen intuitively. Note that the term from Descent Lemma minorizes the function, while the term from strong-convexity majorizes the function. Also, the linear term is exactly the same in both cases. This means that strong convexity implies a quadratic function that always lies below the function, and L-smoothness implies a quadratic function that always lies above the function. An example of this relationship is given in Figure 1.

Figure 1. Relationship between L-smoothness, which gives a majorant, and $\mu$-strongly conexity, which gives a miorant. .

Finally, we show that differentiable strongly-convex function indeed satisfy PL inequality.

Propsition 4 Differentiable $L$-smooth $\mu$-strongly convex functions satisfy PL inequality, with the PL constant equal to $\mu$.

Proof. Take the inequality implied by Proposition 2:

\[\begin{equation} \begin{split} f(\by) \geq f(\bx) + \inner{\nabla f(\bx)}{\by - \bx} + \frac{\mu}{2}\norm{\by - \bx}^2, \end{split} \end{equation}\]

and minimize both sides wrt $\by$. This is allowed since both sides are convex, so the minimizer is well-defined; additionally, since the inequality holds for all points in the domain, the inequality will also hold for the minimized point. Notice that this is easy since $f$ is differentiable and the right side is a linear function in $\by$, so we can simply solve for $\by$ where the gradient is equal to $\bzero$. Plugging in the minimizer for the RHS ($\by^* = \bx - \frac{\nabla f(\bx)}{\mu}$), yields:

\[\begin{equation} \begin{split} & f(\bx^*) \geq f(\bx) - \frac{1}{2\mu} \norm{\nabla f(\bx)}^2 \\ \Rightarrow \quad & 2\mu(f(\bx) - f(\bx^*)) \leq \norm{\nabla f(\bx)}^2, \end{split} \end{equation}\]

as desired.

Thus, strongly-convex function satisfy PL inequality, and get the linear convergence rate “for free”.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Notes on "Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions" (2018)
  • Notes on "Attention Is All You Need" (2017)
  • Notes on Conditional Gradient Method
  • Notes on Fenchel Conjugate and Duality
  • Comparing Non-Smooth Optimization through Subgradients on Regularized Logistic Regression Problem