You are currently browsing the category archive for the ‘Probability and Statistics’ category.

I use stochastic gradient and stochastic approximation algorithms regularly, and I often find it surprisingly difficult to locate basic results on the convergence of these algorithms, despite multiple books written on the subject. For example, consider a convex function {f: \mathbb R^n \rightarrow \mathbb R}. Denote the set of minima of {f} by {X^* := \arg \min_{x \in \mathbb R^n} \{f(x)\}}. If {X^*} is not empty, then we know that it is convex set, but we do not know a priori if {X^*} consists of a single point. Denote the subdifferential of {f} at {x}, i.e., the set of subgradients of {f} at {x}, by {\partial f(x)}. A stochastic subgradient algorithm is an algorithm of the form

\displaystyle x_0 \in \mathbb R^n, \;\; x_{k+1} = x_k - \gamma_k d_{k+1}, k \geq 0, \ \ \ \ \ (1)

with {g_{k+1} := E[d_{k+1} | \mathcal F_k] \in \partial f(x_k)}, for {\mathcal F_k = \sigma(x_0; d_l, 1 \leq l \leq k)}, and {\{\gamma_k\}} a sequence of nonnegative stepsizes. We can rewrite {d_k = g_k + \epsilon_k}, for {K \geq 1}, where {\{\epsilon\}_{k \geq 1}} is a martingale difference sequence with respect to {\{\mathcal F_k\}_{k \geq 0}}, i.e., {E[\epsilon_k | \mathcal F_{k-1}] = 0}, a.s., for all {k \geq 1}. We have the following theorem.

Theorem 1 Suppose that the set of minima of {X^*} of {f} is non empty, and that the stochastic subgradients satisfy

\displaystyle \sup_{k \geq 0} E[\|d_{k+1}\|^2|\mathcal F_k] < K, \;\; \text{for some } K < \infty.

Moreover, assume that {\sum_{k=0}^\infty \gamma_k = + \infty}, {\sum_{k=0}^\infty \gamma^2_k < + \infty}. Then the sequence of iterates {\{x_k\}_{k \geq 0}} following (1) converges almost surely to some minimum {x^* \in X^*}.

In various papers and books and books on stochastic approximations, e.g. [1-4], one can find very general theorems on the convergence of algorithms such as (1), including various additional error terms. Still, trying to apply these general theorems to the simple situation considered here, you might find that you need to make additional assumptions, for example that {f} is continuously differentiable, that {X^*} is a single point, and/or that you know somehow a priori that the sequence {\{x_k\}_{k \geq 0}} is bounded almost surely. The assumptions on {f} and {X^*} might not hold here (consider for example {f(x) = \max\{-x,0,x-1\}}), and the extra work required to prove that {x_k} is bounded is essentially of the same order as proving the theorem from scratch in this case. This is not to say that Theorem 1 cannot be found in the literature. For example, the proof below adapts one found for a somewhat more specific context in [5]. But in my opinion a good book on stochastic approximation algorithms should reference such a basic and widely applicable theorem more clearly. The proof presented here uses the following useful version of the supermartingale convergence theorem, see [6].

Theorem 2 (Supermartingale Convergence Theorem) Let {Y_k, X_k, Z_k, k = 0,1,\ldots} be three sequences of random variables and let {\{\mathcal F_k\}_{k \geq 0}} be a filtration (i.e., sigma algebras such that {\mathcal F_k \subset \mathcal F_{k+1}}). Suppose that

  1. The random variables {Y_k, X_k, Z_k} are nonnegative and {\mathcal F_k}-measurable.
  2. For each {k}, we have {E[Y_{k+1}|\mathcal F_k] \leq Y_k - X_k + Z_k}.
  3. There holds {\sum_{k = 0}^\infty Z_k < \infty}.

Then, we have {\sum_{k=0}^\infty X_k < \infty}, and the sequence {Y_k} converges to a nonnegative random variable {Y}, with probability {1}.

Proof: (of the stochastic subgradient convergence theorem) For {y \in \mathbb R^n} and {g_{k+1} \in \partial f(x_k)}, we have by definition of a subgradient

\displaystyle f(y) \geq f(x_k) + g_{k+1}^T (y-x_k).


\displaystyle \begin{array}{rcl} E[\|x_{k+1}-y\|^2 | \mathcal F_k] &=& E[\|x_k - \gamma_k^T d_{k+1} - y\|^2] \\ &=& E[\|x_k-y\|^2 | \mathcal F_k] - 2 \gamma_k (x_k-y)^T E[d_{k+1} | \mathcal F_k] \\ &+& \gamma_k^2 E[\|d_{k+1}\|^2 | \mathcal F_k] \\ & \leq & E[\|x_k-y\|^2 | \mathcal F_k] - 2 \gamma_k (f(x_k)-f(y)) + \gamma_k^2 K. \end{array}

This inequality is fundamental to study the progress made by many subgradient algorithms. We use it first by taking {y = \bar x^*}, for some {\bar x^* \in X^*}, to get

\displaystyle E[\|x_{k+1}-\bar x^*\|^2 | \mathcal F_k] \leq E[\|x_{k+1}-\bar x^*\|^2 | \mathcal F_k] - 2 \gamma_k (f(x_k)-f(\bar x^*)) + \gamma_k^2 K.

Now by the supermartingale convergence theorem, we deduce that almost surely, {\sum_{k=0}^\infty \gamma_k (f(x_k)-f(\bar x^*)) < \infty} and the sequence {\{\|x_k - \bar x^*\|\}_k} converges almost surely. In particular, {\{x_k\}_k} is bounded almost surely.

The first conclusion implies immediately that

\displaystyle \lim_{k \rightarrow \infty} f(x_k) = f^* = \min_{x \in \mathbb R^n} f(x),

and it only remains to show that {\{x_k\}_k} truly converges to one of the minima, rather than just to the set {X^*}. For this, take a countable set of points {\{x_i^*\}_{i \in I}} in {X^*}, dense in {X^*}. For each such point {x_i^*}, we have as above that {\{\|x_k - x_i^*\|\}_k} converges. Hence with probability {1}, all sequences {\{\|x_k-x^*_i\|\}_k} for {i \in I} converge. Since the sequence {\{x_k\}_k} is bounded, it has at least one limit point. This limit point must be unique in view of the convergence of all the sequences {\{\|x_k-x^*_i\|\}_k}. Hence {\{x_k\}_k} converges. \Box

[1] A. Benveniste, M. Métivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer, 1990.

[2] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd Edition, 2003.

[3] J. C. Spall, Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.

[4] V. S. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.

[5] D. P. Bertsekas and A. Nedic and A. E. Ozdaglar, Convex Analysis and Optimization. Athena Scientific, 2003.

[6] J. Neveu, Discete Parameter Martingales, North-Holland, 1975.