Convergence of (Stochastic) Gradient Descent

Thu Feb 6, 2020 · 1857 words · 10 min

Stochastic gradient descent is the fundamental work horse of deep learning. Therefore it's worth taking a deeper look at at various provable properties of gradient descent algorithms.

Preliminary Definitions

Let a function $f: \mathbf{R}^n \rightarrow \mathbf{R}$ be $L$-Lipschitz with respect to an arbitrary metric (we'll be using norm) if $$||f(x) - f(y)||_2 \le L||x-y||_2 $$ Let the function be convex if $$f(\lambda x + (1 − \lambda)y) ≤ \lambda f(x) + (1 − \lambda)f(y)$$ for any $x,y \in \mathbf{R}$ and $0\le \lambda \le 1$.

Analysis of Full Gradient Descent

Gradient Descent

Gradient descent is recursively defined by $x_{i+1} = x_i - \alpha \nabla f(x_i)$. $f(x_i)$ is the loss function over all the data for the model parameters $x_i$. In other words $f(x_i)=\frac{1}{n} \sum_{j=0}^n \nabla_j f(x_i)$. Furthermore let us define the global minimizer $x^* = \argmin_{x} f(x)$.

Lemma 1: Convex and Lipschitz Gradient Inequality

Given a function $f: \mathbf{R}^d \rightarrow \mathbf{R}$ which is convex, and whose gradient is $L$-Lipschitz ($|\nabla f(x) - \nabla f(y)|_2 \leq L |x-y|_2$). $$f(y) \leq f(x) + \langle\nabla f(x), y-x\rangle + \frac{L}{2} |x-y|_2^2$$

Proof:

Let us define a new function $g(\epsilon) = f(x + \epsilon(y-x))$. Using the fundamental theorem of calculus we know $$g(1) = g(0) + \int_0^1 \nabla g(\epsilon) d\epsilon $$ Rewriting in terms of $f$ and taking derivative we get:

$\small \begin{aligned} f(y) &= f(x) + \int_0^1 \nabla \langle f(x + \epsilon(y-x)), y-x \rangle d\epsilon \\ &=f(x) + \langle \nabla f(x), y-x \rangle + \int_0^1 \nabla \langle f(x + \epsilon(y-x)) - \nabla f(x), y-x \rangle d\epsilon && \text{Add and subtract } \nabla f(x) \text{ term} \\ &\le f(x) + \langle \nabla f(x), y-x \rangle + \int_0^1 \nabla |f(x + \epsilon(y-x)) - \nabla f(x)|_2 |y-x|_2 d\epsilon && \text{Cauchy–Schwarz inequality} \\ &\le f(x) + \langle \nabla f(x), y-x \rangle + \int_0^1 L|\epsilon(y-x)|_2 |y-x|_2 d\epsilon && \text{Lipschitz} \\ &= f(x) + \langle\nabla f(x), y-x\rangle + \frac{L}{2} |x-y|_2^2 && \text{Evaluate Integral} \end{aligned}$

Theorem 1: Convergence of Gradient Descent (1)

Given a convex function $f: \mathbf{R}^d \rightarrow \mathbf{R}$ with an $L$-Lipschitz gradient, then iterating $k$ times over the gradient descent algorithm with $\alpha = \frac{1}{L}$ satisfies inequality. $$ f(x_k) \le f(x^*) + \frac{|x_0 - x_k|_2^2}{2k\alpha} $$ In other words to find a $\epsilon$ optimal solution ($||x_0 - x_k|_2^2| \le \epsilon$), we need at most $\frac{L|x_0 - x_k|_2^2}{\epsilon}$ steps.

Proof:

The proof trivially follows given the previous lemma

$\small \begin{aligned} f(x_{i+1}) &\le f(x_i) + \langle\nabla f(x_i), x_{i+1}-x_i\rangle + \frac{L}{2} |x_{i+1}-x_i|_2^2 \\ &= f(x_i) + \langle\nabla f(x_i), - \alpha \nabla f(x_i)\rangle + \frac{L}{2} |- \alpha \nabla f(x_i)|_2^2 && \text{Plug in GD Definition} \\ &= f(x_i) - \alpha |\nabla f(x_i)|_2^2 + \frac{L\alpha^2}{2} |\nabla f(x_i)|_2^2 \\ &= f(x_i) - \alpha (1-\frac{L\alpha}{2}) |\nabla f(x_i)|_2^2 \\ &\le f(x_i) - \frac{\alpha}{2}|\nabla f(x_i)|_2^2 && \text{Given that }L\alpha \le 1\\ \end{aligned}$

We already have an interesting result; gradient descent optimizes the objective function $f$ in a monotonic fashion. In other-words every step taken (given correct $\alpha$) will produce a smaller and smaller loss. But we haven't proven our theorem yet. Note that since $f$ is convex we have the following inequality:

$f(x_i) \le f(x^*) + \langle \nabla f(x_i), x_i - x^* \rangle$

Utilizing the previous two equations:

$\small \begin{aligned} f(x_{i+1}) &\le f(x_i) - \frac{\alpha}{2}|\nabla f(x_i)|_2^2 \\ &\le f(x^*) + \langle \nabla f(x_i), x_i - x^* \rangle - \frac{\alpha}{2}|\nabla f(x_i)|_2^2 && \text{Plug in Convexity Bound}\\ &= f(x^*) + \frac{1}{\alpha}\langle x_i - x_{i+1}, x_i - x^* \rangle - \frac{1}{2\alpha}|x_i - x_{i+1}|_2^2 && \nabla f(x_i) = \frac{1}{\alpha} (x_i - x_{i+1})\\ &= f(x^*) + \frac{1}{2\alpha} (|x_i-x^*|_2^2 - |x_{i+1}-x_*|_2^2) && \text{Expand out/Complete the Square} \end{aligned}$

Notice that within our bound we have subtraction between two atoms which utilize adjacent elements in our GD sequence. We can use some type of telescoping sum trick to get our bound where we need it to be.

$\sum_{i=0}^{k-1} \left[f(x_i+1) - f(x)\right] \le \frac{1}{2\alpha}\left[|x_0-x^{*}|_{2}^{2} - |x_{k}-x_*|_2^2\right] \le \frac{|x_0-x^*|_2^2}{2\alpha}$

Also notice $\frac{|x_0-x^*|_2^2}{2\alpha}$ is the upper bound on the delta error that we have, and since the objective function $f$ sequence is weakly monotonically decreasing, we can write a very union bound-esque bound.

$\left[f(x_k) - f(x^*)\right] \le \frac{|x_0-x^*|_2^2}{2k\alpha}$

Giving us our result final results:

$f(x_k) \le f(x^*) + \frac{|x_0-x^*|_2^2}{2k\alpha}$

The result itself is not super impressive, the bound is quite weak, but nevertheless it shows linear convergence of GD. Furthermore the result hints at a methodology for selecting the optimal learning rate $\alpha$.

Analysis of Stochastic Gradient Descent

The analysis above that we did for gradient descent is not directly applicable to stochastic gradient descent. In SGD, we do not compute the gradient $\nabla f(x_i)$ over all the data but instead over a small subset which provides us with an unbiased estimator $\mathbb{E}_{v \sim D} \left[ \nabla f_v(x_i) \right]$. We call the $v \in \mathbf{R}^d$ the sampling vector. The classical way of computing $\nabla f_v(x_i)$ is selecting a small subset of examples from the total dataset and computing the gradient over the batch.

Our SGD algorithm is recursively defined as:

$x_{i+1} = x_{i} - \alpha_i \nabla f_{v_i}(x_i)$

where $\alpha_i$ is the learning rate at the ith step of the process.

We need to state a couple of assumptions about our stochastic loss function $f_v(\cdot)$ in order to get any serious result.

Assumption 1: Expected Smoothness of $f_v(\cdot)$

Similar to the Lipschitz inequality we had in the full gradient descent setting, we state that $f_v(\cdot)$ is $L$-smooth if it satisfies the following inequality.

$$$ \mathbb{E}_{D}\left[|\nabla f_v(x) - \nabla f_v(x^*)\right|^2_2] \le 2L\left[f(x)-f(x^*)\right] $$ This is slightly different than the Lipschitz inequality due to the lack of metric on the right side of the bound. But intuitively it makes sense.$

Assumption 2: Bounded Gradient Noise

Another assumption we need to make is that our gradient noise is finite. This is a realistic assumption as there is no expectation that we can minimize a stochastic loss function with non-informative gradients, using SGD.

$$$ \sigma^2 = \mathbb{E}_{D}\left[|\nabla f_v(x)|^2_2\right] $$ is finite.$

Of course we can also bound $\mathbb{E}_{D}\left[|\nabla f_v(x)|^2_2\right] \le 4L\left[f(x)-f(x^*)\right] + 2\sigma^2$ using our $L$-smoothness condition.

$\begin{aligned} \mathbb{E}_{D}\left[|\nabla f_v(x)|^2_2\right] &= \mathbb{E}_{D}\left[|\nabla f_v(x)-\nabla f_v(x^*)+\nabla f_v(x^*)|^2_2\right] \\ &\le 2\left[\mathbb{E}_{D}\left[|\nabla f_v(x)-\nabla f_v(x^*)|_2^2\right]+\mathbb{E}_{D}\left[\nabla f_v(x^*)|^2_2\right]\right] && \text{Jensen's Inequality} \\ &\le 2\left[2L\left[f(x)-f(x^*)\right]+\sigma^2]\right] && \text{Bounded Gradient/Smoothness}\\ &= 4L\left[f(x)-f(x^*)\right] + 2\sigma^2 \end{aligned}$

Theorem 2: Convergence of SGD

Let $f: \mathbf{R}^d \rightarrow \mathbf{R}$ has a $\mu$-Lipschitz gradient, is convex, and is $L$-smooth with respect to $\left[f,D\right]$. Furthermore select $0 < \alpha_i < \frac{1}{2L}$ then

$\mathbb{E}\left[|x_k - x^*|_2^2\right] \le (1-\alpha\mu)^k |x_0 - x^*|_2^2 + \frac{2\alpha\sigma^2}{\mu}$

Proof

For verbosity reasons let $r_k = |x_k - x^*|_2^2$. Then

$\begin{aligned} r_{k+1} &= |x_k-x^*-\alpha \nabla f_{v_k}(x_k)|_2^2 && \text{Definition of SGD} \\ &= |r_k|_2^2 - 2\alpha\langle r_k,\nabla f_{v_k}(x_k)\rangle + \alpha^2 |\nabla f_{v_k}(x_k)|_2^2 && \text{Factor Out} \end{aligned}$

Now let's take the expectation of $|r_{k+1}|_2^2$

$\begin{aligned} \mathbb{E}_{D}\left[r_{k+1}\right] &= |r_k|_2^2 - 2\alpha\langle r_k,\nabla f(x_k)\rangle + \alpha^2 \mathbb{E}_{D}\left[|\nabla f_{v_k}(x_k)|_2^2 \right] && \nabla f_{v_k}(x_k) \text{ is unbiased estimator} \\ &\le (1-\alpha\mu) \mathbb{E}\left[r_k\right]^2_2 + 2\alpha(f(x_k)-f(x^*)) + \alpha^2 \mathbb{E}_D \left[\nabla f_{v_k}(x_k)\right]&& \text{Convex and Lipschitz Gradient Inequality}\\ \end{aligned}$

The very last step looks complicated but it's actually pretty straightforward. We're applying the Convex and Lipschitz Gradient Inequality bound we showed in the full gradient descent section. We can now take another expectation:

$\begin{aligned} \mathbb{E}\left[r_{k+1}\right]^2 &\le (1-\alpha\mu) \mathbb{E}\left[r_k\right]^2_2 + 2\alpha(\mathbb{E}\left[f(x_k)-f(x^*)\right]) + \alpha^2 \mathbb{E}_D \left[\nabla f_{v_k}(x_k)\right]\\ &\le (1-\alpha\mu) \mathbb{E}\left[r_k\right]^2_2 + 2\alpha(2\alpha L - 1)(\mathbb{E}\left[f(x_k)-f(x^*)\right]) + \alpha^2 \sigma^2 && \text{Bounded Gradient Noise}\\ &\le (1-\alpha\mu) \mathbb{E}\left[r_k\right]^2_2 + 2\alpha^2\sigma^2 && 2\alpha L \le 1 \end{aligned}$

We can apply a similar summing trait we used for full gradient descent.

$\begin{aligned} \mathbb{E}\left[r_{k+1}\right]^2 &\le (1-\alpha\mu)^k |r_0|^2_2 + 2\sum_{j=0}^{k-1} (1-\alpha\mu)^j\alpha^2\sigma^2 \\ &\le (1-\alpha\mu)^k |r_0|^2_2 + \frac{2\alpha\sigma^2}{\mu} && \text{Evaluate Sum} \\ \\ \\ \mathbb{E}\left[|x_k - x^*|_2^2\right] &\le (1-\alpha\mu)^k |x_0 - x^*|_2^2 + \frac{2\alpha\sigma^2}{\mu} && \text{QED} \end{aligned}$

All done. We found a way to bound the the error of SGD, by the number of steps, learning rate, gradient noise, and smoothness of the function. It's quite beautiful that our bound is a function of all the variables we intuitively would think effects convergence rate.