Comparing Non-Smooth Optimization through Subgradients on Regularized Logistic Regression Problem

$$ \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}}\newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}}\newcommand{\bB}{\mathbf{B}} \newcommand{\bc}{\mathbf{c}}\newcommand{\bC}{\mathbf{C}} \newcommand{\bd}{\mathbf{d}}\newcommand{\bD}{\mathbf{D}} \newcommand{\be}{\mathbf{e}}\newcommand{\bE}{\mathbf{E}} \newcommand{\bff}{\mathbf{f}}\newcommand{\bF}{\mathbf{F}} \newcommand{\bg}{\mathbf{g}}\newcommand{\bG}{\mathbf{G}} \newcommand{\bh}{\mathbf{h}}\newcommand{\bH}{\mathbf{H}} \newcommand{\bi}{\mathbf{i}}\newcommand{\bI}{\mathbf{I}} \newcommand{\bj}{\mathbf{j}}\newcommand{\bJ}{\mathbf{J}} \newcommand{\bk}{\mathbf{k}}\newcommand{\bK}{\mathbf{K}} \newcommand{\bl}{\mathbf{l}}\newcommand{\bL}{\mathbf{L}} \newcommand{\bm}{\mathbf{m}}\newcommand{\bM}{\mathbf{M}} \newcommand{\bn}{\mathbf{n}}\newcommand{\bN}{\mathbf{N}} \newcommand{\bo}{\mathbf{o}}\newcommand{\bO}{\mathbf{O}} \newcommand{\bp}{\mathbf{p}}\newcommand{\bP}{\mathbf{P}} \newcommand{\bq}{\mathbf{q}}\newcommand{\bQ}{\mathbf{Q}} \newcommand{\br}{\mathbf{r}}\newcommand{\bR}{\mathbf{R}} \newcommand{\bs}{\mathbf{s}}\newcommand{\bS}{\mathbf{S}} \newcommand{\bt}{\mathbf{t}}\newcommand{\bT}{\mathbf{T}} \newcommand{\bu}{\mathbf{u}}\newcommand{\bU}{\mathbf{U}} \newcommand{\bv}{\mathbf{v}}\newcommand{\bV}{\mathbf{V}} \newcommand{\bw}{\mathbf{w}}\newcommand{\bW}{\mathbf{W}} \newcommand{\indep}{\perp \!\!\! \perp} \newcommand{\bx}{\mathbf{x}}\newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}}\newcommand{\bY}{\mathbf{Y}} \newcommand{\bz}{\mathbf{z}}\newcommand{\bZ}{\mathbf{Z}} \newcommand{\balpha}{\boldsymbol{\alpha}}\newcommand{\bAlpha}{\boldsymbol{\Alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}}\newcommand{\bBeta}{\boldsymbol{\Beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}}\newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}}\newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}}\newcommand{\bEpsilon}{\boldsymbol{\Epsilon}} \newcommand{\bzeta}{\boldsymbol{\zeta}}\newcommand{\bZeta}{\boldsymbol{\Zeta}} \newcommand{\beeta}{\boldsymbol{\eta}}\newcommand{\bEta}{\boldsymbol{\Eta}} % \beta already taken \newcommand{\btheta}{\boldsymbol{\theta}}\newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\biota}{\boldsymbol{\iota}}\newcommand{\bIota}{\boldsymbol{\Iota}} \newcommand{\bkappa}{\boldsymbol{\kappa}}\newcommand{\bKappa}{\boldsymbol{\Kappa}} \newcommand{\blambda}{\boldsymbol{\lambda}}\newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bmu}{\boldsymbol{\mu}}\newcommand{\bMu}{\boldsymbol{\Mu}} \newcommand{\bnu}{\boldsymbol{\nu}}\newcommand{\bNu}{\boldsymbol{\Nu}} \newcommand{\bxi}{\boldsymbol{\xi}}\newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bomikron}{\boldsymbol{\omikron}}\newcommand{\bOmikron}{\boldsymbol{\Omikron}} \newcommand{\bpi}{\boldsymbol{\pi}}\newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\brho}{\boldsymbol{\rho}}\newcommand{\bRho}{\boldsymbol{\Rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}}\newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btau}{\boldsymbol{\tau}}\newcommand{\bTau}{\boldsymbol{\Tau}} \newcommand{\bypsilon}{\boldsymbol{\ypsilon}}\newcommand{\bYpsilon}{\boldsymbol{\Ypsilon}} \newcommand{\bphi}{\boldsymbol{\phi}}\newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bchi}{\boldsymbol{\chi}}\newcommand{\bChi}{\boldsymbol{\Chi}} \newcommand{\bpsi}{\boldsymbol{\psi}}\newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bomega}{\boldsymbol{\omega}}\newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\nA}{\mathbb{A}} \newcommand{\nB}{\mathbb{B}} \newcommand{\nC}{\mathbb{C}} \newcommand{\nD}{\mathbb{D}} \newcommand{\nE}{\mathbb{E}} \newcommand{\nF}{\mathbb{F}} \newcommand{\nG}{\mathbb{G}} \newcommand{\nH}{\mathbb{H}} \newcommand{\nI}{\mathbb{I}} \newcommand{\nJ}{\mathbb{J}} \newcommand{\nK}{\mathbb{K}} \newcommand{\nL}{\mathbb{L}} \newcommand{\nM}{\mathbb{M}} \newcommand{\nN}{\mathbb{N}} \newcommand{\nO}{\mathbb{O}} \newcommand{\nP}{\mathbb{P}} \newcommand{\nQ}{\mathbb{Q}} \newcommand{\nR}{\mathbb{R}} \newcommand{\nS}{\mathbb{S}} \newcommand{\nT}{\mathbb{T}} \newcommand{\nU}{\mathbb{U}} \newcommand{\nV}{\mathbb{V}} \newcommand{\nW}{\mathbb{W}} \newcommand{\nX}{\mathbb{X}} \newcommand{\nY}{\mathbb{Y}} \newcommand{\nZ}{\mathbb{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cW}{\mathcal{W}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \DeclareMathOperator*{\argmax}{argmax~} \DeclareMathOperator*{\argmin}{argmin~} \DeclareMathOperator*{\Tr}{Tr} \DeclareMathOperator*{\Bias}{Bias} \DeclareMathOperator*{\Var}{Var} \newcommand{\Perp}{\perp\!\!\! \perp} \let\dsad\d \renewcommand\d{\, \mathrm{d}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Eb}[1]{\mathbb{E} \left[ #1 \right]} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} \newcommand{\vocab}[1]{\textbf{\color{blue} #1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\inner}[2]{ \left \langle #1, #2 \right \rangle } \newcommand{\ibra}[1]{\llbracket #1 \rrbracket} \newenvironment{nalign}{ \begin{equation} \begin{aligned} }{ \end{aligned} \end{equation} \ignorespacesafterend } \newcommand{\icol}[1]{ \left(\begin{smallmatrix}#1\end{smallmatrix}\right)% } \newcommand{\cc}[1]{\overline{#1}} \newcommand{\tr}[1]{\text{tr} \left( #1 \right)} \newcommand{\aff}{\text{aff} \,} \newcommand{\ri}{\text{ri} \, } \newcommand{\rb}{\text{rb} \, } \newcommand{\cl}{\text{cl} \, } \newcommand{\conv}{\text{conv} \, } \newcommand{\irow}[1]{ \begin{smallmatrix}(#1)\end{smallmatrix}% } \newcommand\abs[1]{\left \lvert #1 \right \rvert} $$

Goal

I want to take a closer look at the convex optimization algorithms in the case when the optimized function is convex, but non-smooth. Previously I looked at some convergence properties of subgradient-based optimization algorithms; these included Subgradient Method, Proximal-Gradienth Method and its special case – Projected Gradient method. Now it is time to get a feel how these perform in practise.

We will be concerned with a simple multivariate logistic regression problem with L1 regularization. We assume three classes are present in the data, and our goal is to discriminate between them. To keep things visualizable, we assume that the input space is 3D, but the input space contains a redundant feature that is the sum of the first two (and our “real” data lives in a 2D manifold). The goal here is to recover a good solution with ignoring this redundant feature. Since Gradient Descent technically can only work with differentiable functions, it will be interesting to see how it fares against specialized subgradient methods when the non-differentiable L1 regularization is in the objective.

Defining the model

We use a Logistic Regression method and are concerned with minimizing the negative log-likelihood (NLL) of the Bernoulli distribution specified by our model. In particular, we assume $n$ dataponts, $m$ features for each datapoint (these include the bias in feature representation, so all future references to inputs assume augmented input vectors) and $k$ classes. We denote the random variable $X$ to represent the input sample, and a random variable $Y \mid X$ to denote the label of the sample given the data. We are concerned with learning the conditional probability distribution $P(Y = y \mid \bx)$, of the classes given an input, defined by the Bernoulli distribution parameterized by the sigmoid with linear model’s outputs

\[\begin{equation} \begin{split} P(y \mid \bx) := \sigma(f(\bx))^{y} (1 - \sigma(f(\bx))^{1 - y}, \end{split} \end{equation}\]

where $f$ is the model with the weights of interest $\bw$ defined as $f(\bx) := \sigma(\bw^\intercal \bx)$.

Non-regularized objective

We first derive the loss objective through minimizing the negative log-likelhood of the datalabels given the model parameters $\bw$ and data as follows:

\[\begin{equation} \begin{split} \cL(\bw) &= -\log \left ( \prod_i^n P(y_i \mid \bx_i) \right ) \\ &= -\log \left ( \prod_i f(\bx)^{y} (1 - f(\bx)^{1 - y} \right ) \\ &= \sum_i -y_i \log \left(f(\bx_i) \right) - (1-y_i) \log(1 - f(\bx_i)) . \end{split} \end{equation}\]

We can now establish that the loss is a convex function in the weights $\bw$.

Claim The loss $\cL(\bw)$ is a convex function in the weights $\bw$.

Proof. We show that this holds for a single datapoint (the convexity for the sum follows immediately as a consequence). Note that $\nabla_{\bw} \cL(\bw) = - \bx y ( 1 - f(\bx)) + \bx (1 - y) f(\bx)$ and $\nabla^2_\bw \cL(\bw) = \bx f(\bx) (1- f(\bx)) \bx^\intercal = f(\bx) (1- f(\bx))\bx \bx^\intercal$ , which is PSD since $\bv^\intercal f(\bx)(1 - f(\bx)) \bx \bx^\intercal \bv= f(\bx) (1 - f(\bx)) (\bv^\intercal \bx)^2 \geq 0$.

Data

We simply take two 3D Gaussians that live in a 2D space, but have their third “redundant” feature be a linear combination of the first two.

Figure 1. Full space of the data.

Figure 2. "True" manifold on which the informative part of the data lives.

Loss landscape

It will be insightful to inspect the optimization procedures through the iterates that their produce. Since we have more variables than reasonably visualizable, I only show the second and third features, we handle the other two variables as follows: I fix the bias $w_0 =0$, and take such a feature $w_3$ that has the smallest error. This procedure does not result in a convex loss (though it should be considered as a infimal projection + projection on the bias slice, so this could be wrong), but it is good enough, as we here are only interested in the two informative features (well bias too, but I had to make a compromise).

Figure 3. "Collapsed" logistic regression loss landscape: technically we have four variables (bias and three weight variables), but to make the visualization feasible, we only look at one slice of bias, and take, quite unconventionally, the minimum possible $w_3$ feature.

Choosing a good constant learning rate for GD

As we have seen before, GD converges only if the learning rate is sufficiently small, dictated by the L-smoothness constant of the objective. In order to find the constant for our objective, we could notice that the objective is twice-differentiable, and so we could use the fact that for L-smooth twice-differentiable functions it holds that $\nabla^2 f(\bx) \preceq L \nI$, i.e. the maximal eigenvalue is bounded by $L$. Alternatively, we could ignore the twice-differentiability, and instead find the constant in an ad-hoc way using the definition. Here we find the Lipshitz constant by bounding the Hessian’s eigenvalues.

It is useful to take a moment to cover some basic facts about L-continuity and smoothness.

Claim 1 (Some facts about Lipshitzidness)

A bounded and continuous function $f$ need not be L-continuous;
For a differentiable $f: \R^n \to \R$: $\norm{\nabla f(\bx)} \leq L \quad \forall \bx \Leftrightarrow f$ is Lipshitz-continuous;

Proof idea:

: see a counterexample here.
: “$\Rightarrow$”: For $\bx, \by$ take $g(t) := f(\bx + t(\by - \bx))$, then note that by Fundamental Theorem of Calculus $f(\by) - f(\bx) = \int_0^1 g’(t) \d t$ holds. Then apply absolute function, bound it using Cauchy-Schwarz and get the result. “$\Leftarrow$”: Write out the L-continuity implication and note that for all sequences converging to $\bx$ it holds $\lim_{\by \to \bx} \abs{f(\by) - f(\bx)} / \norm{\by - \bx} \leq L $, where the left side upper-bounds the norm of the gradient $\norm{\nabla f(\bx)}$. (TODO: this argument is a bit sketchy, should make it more formal).

We will also make use of the following fact.

Claim 2 For a PSD real matrix $A$ it holds that

\[\begin{equation} \begin{split} \bv^\intercal A \bv \leq c \norm{\bv}^2 \Leftrightarrow \lambda_{\text{max}}(A) \leq c, \end{split} \end{equation}\]

and so

\[\begin{equation} \begin{split} \max \left \{ \frac{\norm{A \bv}}{\norm{\bv}} \mid \bv \neq \bzero \right \} = \lambda_{\text{max}}(A). \end{split} \end{equation}\]

Proof. Write $A = Q\Sigma Q^\intercal$, where $\Sigma$ is diagonal with non-negative entries, and $Q$ orthonormal matrix, express $\bv$ in the $Q$ basis, use the fact that $\bq_i^\intercal \bq_j = \delta_{ij}$, and bound the coefficients of the expression in this basis by the maximal eigenvalue.

We have showed before that the Hessian of the objective is PSD and reads $\nabla_\bw f(\bx)^2 = f(\bx) (1- f(\bx))\bx \bx^\intercal$, then using the Claim 2 we can find the maximal eigenvalue as follows:

\[\begin{equation} \begin{split} \bv^\intercal \nabla_\bw f(\bx)^2 \bv &= f(\bx) (1- f(\bx)) \bv^\intercal \bx \bx^\intercal \bv \\ &= f(\bx) (1- f(\bx)) (\bv^\intercal \bx)^2 \\ &\leq f(\bx) (1- f(\bx)) \norm{\bv}^2 \norm{\bx}^2 \\ & \leq 0.5 \norm{\bx}^2 \norm{\bv}^2, \end{split} \end{equation}\]

and thus the maximal eigenvalue $\lambda_{\text{max}}(\nabla^2 f(\bx))$ is bounded by $0.5 \norm{\bx}^2$. Of course this only holds only for a single datapoint. If we average out the datapoints, we will need to throw in a scaling factor to account for that. However, I will keep the loss of datapoints as a sum of the losses per datapoints, so the maximal eigenvalue will be scaled down by the number of datapoints.

Sanity check: optimizing non-regularized objective through Gradient Descent

To make sure our initial optimization procedure is sane, we run the GD on the non-regularized objective for 1,000 iterations with the learning rate $1 / L$, where $L$ is the Lipshitz constant. We could have used a bigger learning rate, say $2/L$, but since we did most of our analysis with $1/L$, we will stick with this.

We illustrate the steps the GD takes on the funky non-convex fixed-bias and minimal $w_3$ feature loss landscape, in the following figure.

Figure 4. GD iterates on the loss landscape with fixed (optimal) bias and minimal "redundant" feature $w_3$.

One thing to note here is that we do not take steps orthogonal to the level set, exactly due to the fact that we collapse the results by taking the minimum.

One neat thing about GD on convex problems with an appropriate learning rate, is that the method converges. Note in Figure 5, even for values close to the optimum, the method continues to improve function values. We will see this not to be the case for subgradient method.

Figure 5. Zoomed in trajectory of GD iterates showing that GD converges.

Interestingly, the optimal solution turned out to be ~-0.00150, which is small, but non-zero. Let’s see if adding regularization will “fix” this.

Adding L1 regularization

We augment the objective with a L1 regularization term, which is visually displayed in Figure 6:

\[\begin{equation} \begin{split} \cL(\bw) &= \sum_i -y_i \log \left(f(\bx_i) \right) - (1-y_i) \log(1 - f(\bx_i)) + \lambda \norm{\bw}_1 . \end{split} \end{equation}\]

Figure 6. Loss landscape (in an analogous setting as in Figure 4) of regularized logistic regression.

As we have seen before, L1 regularization induces sparsity, and the hope here is that GD on this loss would result in a zero vector in the redundant dimension $w_3$. There is a caveat, however – the loss is not differentiablle everywhere. GD cannot be applied in this case.

To address this issue, we could fall back to subgradient method. This method is similar to GD, except for it replaces any non-differentiable points by any vector in the subgradient at these points. Of course this is only defined for convex functions, which always have a subgradient. However, if we are not careful in the way that we choose the learning rate, the method will now converge. The point here, which will also be illustrated, is that the learning rate needs to be decayed in order for the method to converge. Disappointingly, this leads to a slow method, that converges to optimal function value with a rate of $\cO(1/\sqrt{n})$ in iterations $n$.

Figure 7. Subgradient method iterates under regularized logistic regression.

If we ignore this caveat and keep the learning rate fixed, the method seemingly approaches an optimal value (Figure 7), but if we take a closer look (Figure 8), the iterates start jumping around a region without stop, suggesting that a fixed learning rate is not appropriate here.

Figure 8. Closer look into subgradient method iterates in regularized setting, where the iterates visibly do not converge.

Addressing the shortcomings of the subgradient method

Very simple solution that leads to a faster convergence rate and allows to remain in the fixed learning rate regime is the proximal gradient method. This is the method that allows for weight vector values to be exactly zero. I will implement this in the future work.