Notes on "Learning Sparse Neural Networks through L⁰ Regularization" (2018)

$$ \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}}\newcommand{\bA}{\mathbf{A}} \newcommand{\bb}{\mathbf{b}}\newcommand{\bB}{\mathbf{B}} \newcommand{\bc}{\mathbf{c}}\newcommand{\bC}{\mathbf{C}} \newcommand{\bd}{\mathbf{d}}\newcommand{\bD}{\mathbf{D}} \newcommand{\be}{\mathbf{e}}\newcommand{\bE}{\mathbf{E}} \newcommand{\bff}{\mathbf{f}}\newcommand{\bF}{\mathbf{F}} \newcommand{\bg}{\mathbf{g}}\newcommand{\bG}{\mathbf{G}} \newcommand{\bh}{\mathbf{h}}\newcommand{\bH}{\mathbf{H}} \newcommand{\bi}{\mathbf{i}}\newcommand{\bI}{\mathbf{I}} \newcommand{\bj}{\mathbf{j}}\newcommand{\bJ}{\mathbf{J}} \newcommand{\bk}{\mathbf{k}}\newcommand{\bK}{\mathbf{K}} \newcommand{\bl}{\mathbf{l}}\newcommand{\bL}{\mathbf{L}} \newcommand{\bm}{\mathbf{m}}\newcommand{\bM}{\mathbf{M}} \newcommand{\bn}{\mathbf{n}}\newcommand{\bN}{\mathbf{N}} \newcommand{\bo}{\mathbf{o}}\newcommand{\bO}{\mathbf{O}} \newcommand{\bp}{\mathbf{p}}\newcommand{\bP}{\mathbf{P}} \newcommand{\bq}{\mathbf{q}}\newcommand{\bQ}{\mathbf{Q}} \newcommand{\br}{\mathbf{r}}\newcommand{\bR}{\mathbf{R}} \newcommand{\bs}{\mathbf{s}}\newcommand{\bS}{\mathbf{S}} \newcommand{\bt}{\mathbf{t}}\newcommand{\bT}{\mathbf{T}} \newcommand{\bu}{\mathbf{u}}\newcommand{\bU}{\mathbf{U}} \newcommand{\bv}{\mathbf{v}}\newcommand{\bV}{\mathbf{V}} \newcommand{\bw}{\mathbf{w}}\newcommand{\bW}{\mathbf{W}} \newcommand{\bx}{\mathbf{x}}\newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}}\newcommand{\bY}{\mathbf{Y}} \newcommand{\bz}{\mathbf{z}}\newcommand{\bZ}{\mathbf{Z}} \newcommand{\balpha}{\boldsymbol{\alpha}}\newcommand{\bAlpha}{\boldsymbol{\Alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}}\newcommand{\bBeta}{\boldsymbol{\Beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}}\newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}}\newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}}\newcommand{\bEpsilon}{\boldsymbol{\Epsilon}} \newcommand{\bzeta}{\boldsymbol{\zeta}}\newcommand{\bZeta}{\boldsymbol{\Zeta}} \newcommand{\beeta}{\boldsymbol{\eta}}\newcommand{\bEta}{\boldsymbol{\Eta}} % \beta already taken \newcommand{\btheta}{\boldsymbol{\theta}}\newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\biota}{\boldsymbol{\iota}}\newcommand{\bIota}{\boldsymbol{\Iota}} \newcommand{\bkappa}{\boldsymbol{\kappa}}\newcommand{\bKappa}{\boldsymbol{\Kappa}} \newcommand{\blambda}{\boldsymbol{\lambda}}\newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bmu}{\boldsymbol{\mu}}\newcommand{\bMu}{\boldsymbol{\Mu}} \newcommand{\bnu}{\boldsymbol{\nu}}\newcommand{\bNu}{\boldsymbol{\Nu}} \newcommand{\bxi}{\boldsymbol{\xi}}\newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bomikron}{\boldsymbol{\omikron}}\newcommand{\bOmikron}{\boldsymbol{\Omikron}} \newcommand{\bpi}{\boldsymbol{\pi}}\newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\brho}{\boldsymbol{\rho}}\newcommand{\bRho}{\boldsymbol{\Rho}} \newcommand{\bsigma}{\boldsymbol{\sigma}}\newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\btau}{\boldsymbol{\tau}}\newcommand{\bTau}{\boldsymbol{\Tau}} \newcommand{\bypsilon}{\boldsymbol{\ypsilon}}\newcommand{\bYpsilon}{\boldsymbol{\Ypsilon}} \newcommand{\bphi}{\boldsymbol{\phi}}\newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bchi}{\boldsymbol{\chi}}\newcommand{\bChi}{\boldsymbol{\Chi}} \newcommand{\bpsi}{\boldsymbol{\psi}}\newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bomega}{\boldsymbol{\omega}}\newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\nA}{\mathbb{A}} \newcommand{\nB}{\mathbb{B}} \newcommand{\nC}{\mathbb{C}} \newcommand{\nD}{\mathbb{D}} \newcommand{\nE}{\mathbb{E}} \newcommand{\nF}{\mathbb{F}} \newcommand{\nG}{\mathbb{G}} \newcommand{\nH}{\mathbb{H}} \newcommand{\nI}{\mathbb{I}} \newcommand{\nJ}{\mathbb{J}} \newcommand{\nK}{\mathbb{K}} \newcommand{\nL}{\mathbb{L}} \newcommand{\nM}{\mathbb{M}} \newcommand{\nN}{\mathbb{N}} \newcommand{\nO}{\mathbb{O}} \newcommand{\nP}{\mathbb{P}} \newcommand{\nQ}{\mathbb{Q}} \newcommand{\nR}{\mathbb{R}} \newcommand{\nS}{\mathbb{S}} \newcommand{\nT}{\mathbb{T}} \newcommand{\nU}{\mathbb{U}} \newcommand{\nV}{\mathbb{V}} \newcommand{\nW}{\mathbb{W}} \newcommand{\nX}{\mathbb{X}} \newcommand{\nY}{\mathbb{Y}} \newcommand{\nZ}{\mathbb{Z}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cH}{\mathcal{H}} \newcommand{\cI}{\mathcal{I}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cK}{\mathcal{K}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cO}{\mathcal{O}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cQ}{\mathcal{Q}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\cT}{\mathcal{T}} \newcommand{\cU}{\mathcal{U}} \newcommand{\cV}{\mathcal{V}} \newcommand{\cW}{\mathcal{W}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}} \newcommand{\cZ}{\mathcal{Z}} \DeclareMathOperator*{\argmax}{argmax~} \DeclareMathOperator*{\argmin}{argmin~} \DeclareMathOperator*{\Tr}{Tr} \DeclareMathOperator*{\Bias}{Bias} \DeclareMathOperator*{\Var}{Var} \newcommand{\Perp}{\perp\!\!\! \perp} \let\dsad\d \renewcommand\d{\, \mathrm{d}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Eb}[1]{\mathbb{E} \left[ #1 \right]} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} \newcommand{\vocab}[1]{\textbf{\color{blue} #1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\inner}[2]{ \left \langle #1, #2 \right \rangle } \newcommand{\ibra}[1]{\llbracket #1 \rrbracket} \newenvironment{nalign}{ \begin{equation} \begin{aligned} }{ \end{aligned} \end{equation} \ignorespacesafterend } \newcommand{\icol}[1]{ \left(\begin{smallmatrix}#1\end{smallmatrix}\right)% } \newcommand{\cc}[1]{\overline{#1}} \newcommand{\tr}[1]{\text{tr} \left( #1 \right)} \newcommand{\aff}{\text{aff} \,} \newcommand{\ri}{\text{ri} \, } \newcommand{\rb}{\text{rb} \, } \newcommand{\cl}{\text{cl} \, } \newcommand{\conv}{\text{conv} \, } \newcommand{\irow}[1]{ \begin{smallmatrix}(#1)\end{smallmatrix}% } \newcommand\abs[1]{\left \lvert #1 \right \rvert} $$

Motivation

The main motivation is to merge conventional regularization with neural nets pruning. The promise of sparse networks, in theory at least, is speed and generalization due to simplified architecture. We usually use $L_1$ and $L_2$ regularization on the weights, which lead to shrinkage of the weights, since these are differentiable and thus easy to optimize. In the extreme lies the $L_0$ “norm”, a non-differentiable function that penalizes non-zero entries regardless of their magnitude. If we could regularize the network using $L_0$ norm, we would be imposing the network to prefer exactly zeros in its network weights. This paper presents a way to achieve this.

Ideal objective

Going with the usual regularized empirical risk minimization approach, we might express the overall “ideal” risk as

\[\begin{equation} \begin{split} \cR^{\text{ideal}}(\btheta) &= \frac{1}{n} \sum_i^n \cL( f(\bx, \btheta), \by ) + \lambda \norm{\btheta}_0 \\ &= \frac{1}{n} \sum_i^n \cL( f(\bx, \btheta), \by ) + \lambda \sum_i^m \nI_{\theta_i \neq 0}, \end{split} \end{equation}\]

and we are interested in the set of paramters $\btheta$ that minimize the overall risk. However, doing so with gradient descent is not feasible due to non-differentiability of the indicator function $\nI$. To go around this problem, we rely on relaxations.

Reparameterizing the objective

The main idea now is to make express the above-defined risk in a probabilistic way. To achieve that, we re-parameterize the $\btheta$ in a more expressive model that still contains the $\cR^{\text{ideal}}$ objective as its subset:

\[\begin{equation} \begin{split} \tilde \btheta := \bz \odot \btheta, \quad \bz = \{ z_i \sim \text{Bern}(\pi_i) \mid z_i \in \{0, 1\} \}_{i=1}^m. \end{split} \end{equation}\]

So essentially we introduce another set of paramters $\bpi$ and draw the boolean masks that indicate whether to block off the weight. In that case we can express the risk as the expected value over draws of these masks as

\[\begin{equation} \begin{split} \cR^{\text{Bern}}(\btheta, \bpi) &= \E_{\bz \sim \text{Bern}(\bpi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) + \lambda \norm{\bz}_0 \right] \\ &= \E_{\bz \sim \text{Bern}(\bpi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) \right] + \lambda \sum_i^m \pi_i, \end{split} \end{equation}\]

where now the regularization objective over $\pi$ is fully differentiable. But this has its own set of problems: 1) the re-parameterized weights $\hat \btheta$ are infeasible to optimize through gradient descent, as those now depend on discrete samples $\bz$; 2) the expectation over the Bernoulli distribution itself contains parameters of interest, but sampling operation is non-differentiable.

Relaxing the objective

To address both of the problems, we relax the discrete Bernoulli distribution with a continuous one, call it $q(\bz \mid \bphi)$, with parameters $\bphi$. There are many continuous distributions that can do the job, so lets ignore the exact relaxation for now, and derive the general objective.

Besides the continuity of the distribution, we additionally impose a constraint that there should be a non-zero probability mass on $q(z_i = 0 \mid \bphi)$, as we want to represent exact zeros. To achieve this, we can simply introduce a gating defined, say, by making use of rectification in the form of $g(x) = \min(1, \max(0, x))$. Given that the relation is continuous, we could pass the relaxed samples through the gating and achieve non-zero mass on the endpoints at zero and one.

Formally, we can represent the new empirical risk with a relaxed distribution in the following setting:

\[\begin{equation} \begin{split} \bs \sim q(\bphi), \quad \bz := g(\bs) = \min(1, \max(0, \bs)), \quad \tilde \btheta := \bz \odot \btheta, \end{split} \end{equation}\] \[\begin{equation} \begin{split} \cR^{\text{relaxed}}(\btheta, \bpi) &= \E_{\bs \sim q(\bphi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) + \lambda \norm{\bz}_0 \right] \\ &= \E_{\bs \sim q(\bphi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) + \lambda \sum_{i}^m \nI_{z_i \neq 0} \right] \\ & \overset{\bs \text{ sampled iid}}{=} \E_{\bs \sim q(\bphi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) \right] + \lambda \sum_{i}^m P(z_i > 0) \\ &= \E_{\bs \sim q(\bphi)}\left[ \frac{1}{n} \sum_i^n \cL( f(\bx, \bz \odot \btheta), \by ) \right] + \lambda \sum_{i}^m 1 - Q(z_i \geq 0), \end{split} \end{equation}\]

where we used the fact (which we actually used before too), that all the elements in $\bs$ are sampled iid from their corresponding relaxed distributions, and have denoted $Q(z_i)$ to denote the CDF of $q(s \mid \bphi)$. This relaxed objective is “nicely” optimizable as long as the relaxed distribution is reparameterizable, and the CDF computable in closed form. Turns out that Binary Concrete (referred to as $\text{BinConcrete}$) is exactly a distribution suitable for the job.

Before jumping into the application of $\text{BinConcrete}$, I quickly recap its derivation, a way to sample from it, and derive its PDF and CDF. All of these will be useful when evaluating the relaxed risk objective.

Short interlude on $\text{BinConcrete}$ distribution (based on Appendix in (missing reference))

We know that a r.v. $X$ follows a $\text{concrete}(\log \balpha, \lambda)$ distribution if $X_i$ is computed using uniform r.v.s $U_i$ and transformed as

\[\begin{equation} \begin{split} X_i = \frac{ \exp \left((\log \alpha_i + U_i) /\lambda \right) }{\sum_j \exp \left((\log \alpha_j + U_j) /\lambda \right)}. \end{split} \end{equation}\]

Before trying to work out the mathematical details, lets think what we want to achieve. We want to backpropogate through an expectation that samples from a distribution with parameters we want to optimize. In this case the distribution is a Bernoulli with two states.

However, this distribution that two degrees of freedom. In the binary case we can achieve a more convenient parameterization with only a single degree of freedom and a single random sample (of course this holds in case we have more variables too, just that in binary case this directly ties to other familiar distributions? I’m not too sure about the motivation behind this yet).

Relaxing the Bernoulli

We can arrive at the relaxed $\text{Bernoulli}$ by following the usual approach of sampling from a $\text{Categorical}$ distribution using Gumbel-Max trick (see previous post) by taking $X$ from a $\text{Bernoulli}(p)$ (s.t. $X_1 + X_2 = 1$), $G_1, G_2 \sim \text{Gumbel}(0,1)$ and expressing $X = \argmax \{ \log p + G_1, \log (1-p) + G_2 \}$. This is the standard form with 2

To simplify this, note that $X_1 = \text{step}(\log p + G_1 > \log (1-p) + G_2)$ (with $\text{step}(x) = 1$ iff $x > 0$), and $X_2$ is fully determined as $X_2 = 1 - X_1$. Thus, we can model $X_1$ in the following way:

\[\begin{equation} \begin{split} X_1 &= \nI_{(\log p + G_1 > \log (1-p) + G_2)} \\ &= \text{step} \left(G_1 - G_2 - (\log (1 - p) - \log p) \right) = \\ &= \text{step}\left (G_1 - G_2 + \log \frac{p}{1-p} \right). \end{split} \end{equation}\]

Denoting $\alpha := \frac{p}{1-p}$, we are only one step away from arriving at the final expression. For that we make use of the following claim.

Claim A difference between Gumbel r.v.s follows a standard Logistic distribution, i.e. $G_1, G_2 \sim \text{Gumbel}(0,1) \Rightarrow G_1 - G_2 \sim \text{Logistic}(0,1)$.

Proof (main steps). Let $D := G_1 - G_2$, then $p(D = d) = \int_{-\infty}^{+\infty} p(G_1 = t) p(G_2 = t - d) \d t $ can be evaluated by the change of variables using $y = e^{-x}$, then arrive at expression of the form of $\int y e^{-cx}$, which can is a standard integral, finally arrive to $ \frac{e^x}{(1 + e^x)^2}$, then notice that this is symmetric, which is exactly the PDF of the standard Logistic.

Thus, we can replace the difference of Gumbels with the standard Logistic r.v., and rewrite

\[\begin{equation} \begin{split} X_1 = \text{step}(L - \log \alpha > 0), \end{split} \end{equation}\]

as the final form of the (exact) way to sample from the Bernoulli. But since our goal was to sample in a differntiable way, using the same logic as in the derivation of $\text{Concrete}$ distribution, we can approximate the $\text{step}$ function with its equivalent – the standard logistic function, which is nothing more than the CFD of the Logistic distribution.

Definition The logistic function $\sigma: \R \to (0, 1), x \mapsto \frac{1}{1 + e^{-x}}$ is defined as the CDF of standard Logistic.

As such, we can relax the discrete $\text{step}$ function with its continuous logistic counterpart. As before, to control the degree of relaxation a temperature $\lambda$ is introduced, and as $\lambda \to 0$ the relaxed variable goes towards its discrete version. Putting these parts together, we arrive at the $\text{BinConcrete}$ distribution, which we can define in terms of its sampling.

Definition Given $\alpha \in (0,+\infty)$, $\lambda > 0$, we say that if a r.v.$L \sim \text{Logistic}(0,1)$, then $X := \sigma( (L + \log \alpha) / \lambda )$ follows a $\text{BinConcrete}(\log \alpha, \lambda)$ distribution.

As a sanity check, suppose $p \to 1, 1-p \to 0$, then $\log \frac{p}{1-p} \to +\infty$, then the sigmoid is 1; if $p \to 0, 1-p \to 1$, then $\log \alpha \to -\infty$, and the sigmoid is a hard 0.

Deriving the PDF

Since we know the change of variables that yield the samples from the Binary Concrete, we can derive its PDF through the change of variable formula by noting that $X$ is a transformed Logistic variable. For reference, I reference the change of variables function in the scalar case:

Theorem A PDF of a transformed random variable $Y = f(X)$, where $f(\cdot)$ is a strictly monotonic function, is given by

\[\begin{equation} \begin{split} p_Y(y) = p_X( f^{-1}(y))\abs{ \frac{\d f^{-1}(y)}{\d y} }, \end{split} \end{equation}\]

i.e. the transformed variable’s density is up to normalization equal to the original variable’s $X$ density at the pre-image of the point.

Denote by $q(l \mid \log \alpha, \lambda)$ the PDF of Binray Concrete at point $l$ with parameters $(\log \alpha, \lambda)$. Since $X$ maps the logistic variable $L$ in a monotonous way, we can apply the change of variables formula directly.

$\begin{equation} \begin{split} q(l \mid \log \alpha, \lambda) = \end{split} \end{equation}$ It is a bit tedious to keep track of all the quantities, but it is straightforward to evaluate as there are no tricks involved.

Deriving the CDF

Now we can integrate the PDF and evaluate the CDF.

Applying gating on $\text{BinConcrete}$ – the $\text{HardConcrete}$ distribution

The HardConcrete is derivable very easily: we simply clamp values less than 0 to 0, bigger than 1 to 1, and end up with a distribution with 3 modes.