I use stochastic gradient and stochastic approximation algorithms regularly, and I often find it surprisingly difficult to locate basic results on the convergence of these algorithms, despite multiple books written on the subject. For example, consider a convex function . Denote the set of minima of by . If is not empty, then we know that it is convex set, but we do not know a priori if consists of a single point. Denote the subdifferential of at , i.e., the set of subgradients of at , by . A stochastic subgradient algorithm is an algorithm of the form
Moreover, assume that , . Then the sequence of iterates following (1) converges almost surely to some minimum .
In various papers and books and books on stochastic approximations, e.g. [1-4], one can find very general theorems on the convergence of algorithms such as (1), including various additional error terms. Still, trying to apply these general theorems to the simple situation considered here, you might find that you need to make additional assumptions, for example that is continuously differentiable, that is a single point, and/or that you know somehow a priori that the sequence is bounded almost surely. The assumptions on and might not hold here (consider for example ), and the extra work required to prove that is bounded is essentially of the same order as proving the theorem from scratch in this case. This is not to say that Theorem 1 cannot be found in the literature. For example, the proof below adapts one found for a somewhat more specific context in . But in my opinion a good book on stochastic approximation algorithms should reference such a basic and widely applicable theorem more clearly. The proof presented here uses the following useful version of the supermartingale convergence theorem, see .
Theorem 2 (Supermartingale Convergence Theorem) Let be three sequences of random variables and let be a filtration (i.e., sigma algebras such that ). Suppose that
- The random variables are nonnegative and -measurable.
- For each , we have .
- There holds .
Then, we have , and the sequence converges to a nonnegative random variable , with probability .
Proof: (of the stochastic subgradient convergence theorem) For and , we have by definition of a subgradient
This inequality is fundamental to study the progress made by many subgradient algorithms. We use it first by taking , for some , to get
Now by the supermartingale convergence theorem, we deduce that almost surely, and the sequence converges almost surely. In particular, is bounded almost surely.
The first conclusion implies immediately that
and it only remains to show that truly converges to one of the minima, rather than just to the set . For this, take a countable set of points in , dense in . For each such point , we have as above that converges. Hence with probability , all sequences for converge. Since the sequence is bounded, it has at least one limit point. This limit point must be unique in view of the convergence of all the sequences . Hence converges.
 A. Benveniste, M. Métivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer, 1990.
 H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd Edition, 2003.
 J. C. Spall, Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.
 V. S. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
 D. P. Bertsekas and A. Nedic and A. E. Ozdaglar, Convex Analysis and Optimization. Athena Scientific, 2003.
 J. Neveu, Discete Parameter Martingales, North-Holland, 1975.