You are currently browsing the category archive for the ‘algorithms’ category.

I use stochastic gradient and stochastic approximation algorithms regularly, and I often find it surprisingly difficult to locate basic results on the convergence of these algorithms, despite multiple books written on the subject. For example, consider a convex function {f: \mathbb R^n \rightarrow \mathbb R}. Denote the set of minima of {f} by {X^* := \arg \min_{x \in \mathbb R^n} \{f(x)\}}. If {X^*} is not empty, then we know that it is convex set, but we do not know a priori if {X^*} consists of a single point. Denote the subdifferential of {f} at {x}, i.e., the set of subgradients of {f} at {x}, by {\partial f(x)}. A stochastic subgradient algorithm is an algorithm of the form

\displaystyle x_0 \in \mathbb R^n, \;\; x_{k+1} = x_k - \gamma_k d_{k+1}, k \geq 0, \ \ \ \ \ (1)

with {g_{k+1} := E[d_{k+1} | \mathcal F_k] \in \partial f(x_k)}, for {\mathcal F_k = \sigma(x_0; d_l, 1 \leq l \leq k)}, and {\{\gamma_k\}} a sequence of nonnegative stepsizes. We can rewrite {d_k = g_k + \epsilon_k}, for {K \geq 1}, where {\{\epsilon\}_{k \geq 1}} is a martingale difference sequence with respect to {\{\mathcal F_k\}_{k \geq 0}}, i.e., {E[\epsilon_k | \mathcal F_{k-1}] = 0}, a.s., for all {k \geq 1}. We have the following theorem.

Theorem 1 Suppose that the set of minima of {X^*} of {f} is non empty, and that the stochastic subgradients satisfy

\displaystyle \sup_{k \geq 0} E[\|d_{k+1}\|^2|\mathcal F_k] < K, \;\; \text{for some } K < \infty.

Moreover, assume that {\sum_{k=0}^\infty \gamma_k = + \infty}, {\sum_{k=0}^\infty \gamma^2_k < + \infty}. Then the sequence of iterates {\{x_k\}_{k \geq 0}} following (1) converges almost surely to some minimum {x^* \in X^*}.

In various papers and books and books on stochastic approximations, e.g. [1-4], one can find very general theorems on the convergence of algorithms such as (1), including various additional error terms. Still, trying to apply these general theorems to the simple situation considered here, you might find that you need to make additional assumptions, for example that {f} is continuously differentiable, that {X^*} is a single point, and/or that you know somehow a priori that the sequence {\{x_k\}_{k \geq 0}} is bounded almost surely. The assumptions on {f} and {X^*} might not hold here (consider for example {f(x) = \max\{-x,0,x-1\}}), and the extra work required to prove that {x_k} is bounded is essentially of the same order as proving the theorem from scratch in this case. This is not to say that Theorem 1 cannot be found in the literature. For example, the proof below adapts one found for a somewhat more specific context in [5]. But in my opinion a good book on stochastic approximation algorithms should reference such a basic and widely applicable theorem more clearly. The proof presented here uses the following useful version of the supermartingale convergence theorem, see [6].

Theorem 2 (Supermartingale Convergence Theorem) Let {Y_k, X_k, Z_k, k = 0,1,\ldots} be three sequences of random variables and let {\{\mathcal F_k\}_{k \geq 0}} be a filtration (i.e., sigma algebras such that {\mathcal F_k \subset \mathcal F_{k+1}}). Suppose that

  1. The random variables {Y_k, X_k, Z_k} are nonnegative and {\mathcal F_k}-measurable.
  2. For each {k}, we have {E[Y_{k+1}|\mathcal F_k] \leq Y_k - X_k + Z_k}.
  3. There holds {\sum_{k = 0}^\infty Z_k < \infty}.

Then, we have {\sum_{k=0}^\infty X_k < \infty}, and the sequence {Y_k} converges to a nonnegative random variable {Y}, with probability {1}.

Proof: (of the stochastic subgradient convergence theorem) For {y \in \mathbb R^n} and {g_{k+1} \in \partial f(x_k)}, we have by definition of a subgradient

\displaystyle f(y) \geq f(x_k) + g_{k+1}^T (y-x_k).

Hence

\displaystyle \begin{array}{rcl} E[\|x_{k+1}-y\|^2 | \mathcal F_k] &=& E[\|x_k - \gamma_k^T d_{k+1} - y\|^2] \\ &=& E[\|x_k-y\|^2 | \mathcal F_k] - 2 \gamma_k (x_k-y)^T E[d_{k+1} | \mathcal F_k] \\ &+& \gamma_k^2 E[\|d_{k+1}\|^2 | \mathcal F_k] \\ & \leq & E[\|x_k-y\|^2 | \mathcal F_k] - 2 \gamma_k (f(x_k)-f(y)) + \gamma_k^2 K. \end{array}

This inequality is fundamental to study the progress made by many subgradient algorithms. We use it first by taking {y = \bar x^*}, for some {\bar x^* \in X^*}, to get

\displaystyle E[\|x_{k+1}-\bar x^*\|^2 | \mathcal F_k] \leq E[\|x_{k+1}-\bar x^*\|^2 | \mathcal F_k] - 2 \gamma_k (f(x_k)-f(\bar x^*)) + \gamma_k^2 K.

Now by the supermartingale convergence theorem, we deduce that almost surely, {\sum_{k=0}^\infty \gamma_k (f(x_k)-f(\bar x^*)) < \infty} and the sequence {\{\|x_k - \bar x^*\|\}_k} converges almost surely. In particular, {\{x_k\}_k} is bounded almost surely.

The first conclusion implies immediately that

\displaystyle \lim_{k \rightarrow \infty} f(x_k) = f^* = \min_{x \in \mathbb R^n} f(x),

and it only remains to show that {\{x_k\}_k} truly converges to one of the minima, rather than just to the set {X^*}. For this, take a countable set of points {\{x_i^*\}_{i \in I}} in {X^*}, dense in {X^*}. For each such point {x_i^*}, we have as above that {\{\|x_k - x_i^*\|\}_k} converges. Hence with probability {1}, all sequences {\{\|x_k-x^*_i\|\}_k} for {i \in I} converge. Since the sequence {\{x_k\}_k} is bounded, it has at least one limit point. This limit point must be unique in view of the convergence of all the sequences {\{\|x_k-x^*_i\|\}_k}. Hence {\{x_k\}_k} converges. \Box

[1] A. Benveniste, M. Métivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer, 1990.

[2] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd Edition, 2003.

[3] J. C. Spall, Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.

[4] V. S. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.

[5] D. P. Bertsekas and A. Nedic and A. E. Ozdaglar, Convex Analysis and Optimization. Athena Scientific, 2003.

[6] J. Neveu, Discete Parameter Martingales, North-Holland, 1975.

Advertisements

The dominant software for control system design currently is clearly MATLAB. It has a nice Control System Toolbox, a Model Predictive Control Toolbox, a Robust Control Toolbox, and various other related toolboxes, such as Optimization and Signal Processing. Simulink is very useful for system design, and can be coupled to Stateflow to analyse hybrid systems.

Nevertheless, the main drawback of Matlab is that is it not free, and there has has been a desire to develop more widely accessible alternatives. The first that come to mind are Scilab and Octave. However, despite their qualities these packages have suffered from an insufficient number of users for their usage to become more widespread. For example, Scilab was created in 2003 by INRIA (the French national institute for research in computer science and control) but the overwhelming majority of the Scilab consortium still consists of French research institutes and companies. I believe that one important reason why the situation is unlikely to change is that MATLAB is universally used in the classroom to teach control systems design, in part because it is expected that students entering the industry know this language. Still, this is not a very good explanation: Scilab and Octave are purposely using a syntax very similar to MATLAB.

Anyway, another issue with MATLAB is that it is yet another language to learn, and new features in the language tend to be introduced slowly. For example, object-oriented design has been added only recently. Instead, there is a strong trend currently in using Python for scientific computing, and several scientific communities are enthusiastically developing in Python. Some examples of impressive packages include Numpy and Scipy, Matplotlib for visualization, and Sage, which is a wrapper for a large number of mathematical libraries. One can find an increasing number of libraries for signal processing, machine learning, computer vision, optimization, and so on. With regard to control systems libraries however, there seems to be much less activity. Here are some related links that I know of:

Please let me know of any other Python library related to control systems.

I recently sparked an animated little discussion during a research meeting by questioning the usefulness of the averaging protocol for the consensus problem. It turns out that this is a fairly sensitive issue in the control community these days and pretty much everybody has an opinion on this protocol. First, I was not arguing that consensus (or agreement) problems are very important in distributed systems. They are discussed at length in books on real-time systems and distributed algorithms. They occupy several chapters of Nancy Lynch’s [1] book for example, where they are defined (p.81) as problems where processes in a network begin with arbitrary initial values of a particular type, and are supposed to eventually output the same value (they must agree). This output value must be of the same type as the input values and might be subject to additional conditions, such as being equal to the initial value of at least one process. More often, control theorist have been interested in agreeing on the average of the initial values. The field of distributed algorithms in computer science considers consensus problems for various notions of time in networks (synchronous or asynchronous networks), and is particularly concerned with consensus when some of the processes are subject to failure (in particular, Byzantine fault tolerance).

The consensus or agreement problem has generated a huge literature in control theory over the past decade, in particular since the now famous paper of Ali Jadbabaie et al. [2]. However in contrast to the computer science literature, the problem is often not well defined when studied by control theorists. Fundamentally, the issue probably comes from the fact that in control papers, the processes are typically allowed to communicate real numbers without noise, at least in the basic version of the problem, which is a framework far too powerful from a theoretical computer science point of view (you could code an infinite amount of information in just one such transmission). So what is this control literature about? Well in the control community, consensus has almost lost its actual meaning and has become synonymous with a particular dynamical scheme where processes organized in a network start with arbitrary initial values and continuously average their current value with the value of their neighbors. Following this update law, the value of each process indeed converges to the average of all initial values in the network, at least in the simplest situations, assuming the local averages are performed adequately.

It is important to remember that this local averaging scheme is just one possible algorithm to reach consensus in a network using local communications, and therefore I will keep using the term averaging algorithm instead of consensus when talking about this particular scheme. In this post I give more background and explanations on the consensus problem and averaging protocol. In a follow-up post that I will publish soon, I discuss how the dynamics of this averaging algorithm lead to intriguing and still poorly understood behaviors very quickly. These behaviors in fact have attracted the interest of physicists, biologists and economists for example, and from a scientific point of view it is certainly useful to study them. But I’m also planning to discuss the drawbacks and issues that I see imparing the use of the averaging algorithm in real-world engineering applications (there seems to be very few of those actually, the vast majority of papers only report simulation, where everything is of course simpler ;-)).

A Tutorial on the Averaging Algorithm.

I follow here a recent paper by Alex Olshevksy and John Tsitsiklis [3] (what I call the averaging algorithm here is called the ”agreement algorithm” in that paper). Alex Olshevsky works on the consensus problem as part of his PhD work, and John Tsitsiklis was one of the first researchers who studied the averaging algorithm [4], in particular in asynchronous and time-varying networks.

We consider a set of nodes {\mathcal N=\{1,2, \ldots,n \}}. The nodes start with a scalar value {x_i(0)} and the vector of values for all nodes at (discrete) time {t} is denoted {x(t)=(x_1(t),\ldots,x_n(t))}. The averaging algorithm updates {x(t)} at each period according to the dynamics {x(t+1) = A(t) x(t)}, where {A(t)} is a matrix with nonnegative entries {a_{ij}(t)}. Moreover, we assume that the row sums of {A(t)} are equal to {1}, so that {A(t)} is a stochastic matrix and {x_i(t+1)} is a weighted average of the previous values {x_j(t)}. Intuitively, {a_{ij}(t)>0} means that node {j} communicates its current value to node {i} at time {t}. We let {\mathcal E(t)} be the set of edges {(j,i)} at time {t} for which {a_{ij}(t)>0}, i.e., {\mathcal E(t)} represents the communication graph at time {t}.

Consider the following assumptions:

  1. There exists a positive constant {\alpha} such that {a_{ii}(t) \geq \alpha, \forall i,t}, and {a_{ij}(t) \in \{0\} \cup [\alpha,1], \forall i,j,t}.
  2. {\sum_{j=1}^n a_{ij}(t) = 1, \forall i,t}.
  3. (bounded interconnectivity times) There is some {B} such that for all {k}, the graph {\mathcal N, \mathcal E(kB) \cup \mathcal E(kB+1) \cup \cdots \cup \mathcal E((k+1)B-1))} is strongly connected.

Then under these assumptions, the averaging algorithm guarantees asymptotic consensus, that is, there is a constant {c} (depending on {x(0)} and the sequence of communication graphs) such that {\lim_{t \rightarrow \infty} x_i(t) = c}. See [3] for the references of this result. In the simplest case, the communication graph is fixed ({\mathcal E(t)=\mathcal E}) and the {A(t)} matrix is assumed constant {A(t)=A}. If {A} is a doubly stochastic matrix (i.e., each column of {A} also sums to {1}), then we know that the value {c} is equal to the average of the initial values {c=(\sum_{i=1}^n x_i(0)) / n}.

Note that the averaging algorithm requires a priori that a central designer chooses the matrix {A} and communicates the weights {a_{ij}} to the proper nodes, in order that {A} be stochastic or doubly stochastic. We can improve on this point using ideas from random walks on graphs. Node {i} can set {a_{ij}=1/d_i} for {j} a neighbor of {i}, where {d_i} is the number of neighbors of {i} in the (constant) communication graph, including {i}. It sets {a_{ij}=0} if j is not a neighbor of i. Then {A} is a stochastic matrix which is irreducible and aperiodic under the previous assumptions, with steady-state probabilities {\pi_i = d_i / (\sum_{i=1}^n d_i)}. This is called the equal-neighbor model. The paper [3] describes a variation of the averaging algorithm which converges to the average of the initial values and does not require any central coordination. Each node {i} sets {y_i(0)=1/d_i}, {z_i(0)=x_i(0)/d_i}, and runs in parallel the algorithms {y(t+1) = A y(t)} and {z(t+1) = A z(t)}. Then each nodes sets {x_i(t)=z_i(t)/y(t)}, and we have {\lim_{t \rightarrow \infty} x_i(t) = (\sum_{i=1}^n x_i(0)) / n, \forall i}. The main issue with this algorithm is that it takes {\Theta(n^3 \log (n/\epsilon))} steps for {x(t)} to be {\epsilon}-close to its limit value. The {n^3} term in particular is quite bad since in practice the algorithm is supposed to run on a large number of nodes.

If we allow a central designer to choose the matrix {A}, then we can improve the convergence speed of the algorithm significantly. In general, any matrix {A} (not even necessarily nonnegative) such that the iterations {x(t+1)=A x(t)} converge to the vector {c \mathbf 1} can be used to obtain consensus, with additional conditions needed in order for {c} to be the actual average of the initial values. In particular, we can simply consider matrices {A} such that {\mathbf 1} is an eigenvalue of {A} with multiplicity {1}, the corresponding left eigenvector {\pi} has nonzero entries, and all other eigenvalues have magnitude less than {1}. The paper [3] explains that we can easily construct {A} in such a way that a consensus value is approached at arbitrarily fast rate. However this scheme seems numerically impractical and moreover forces convergence to the initial value of one of the nodes, not the average value. When taking these points into account, Olshevsky and Tsitsiklis describe an averaging algorithm which converges to the average value in time {O(n^2 \log (n/\epsilon))}, essentially by executing the algorithm on a spanning tree (deleting potentially additional arcs that the initial network might have). By considering line graphs, they show that the bound for their algorithm is tight, and also that more sophisticated methods for choosing {A} such as [5], take a similar {\Omega(n^2 \log(1/\epsilon))} number of steps to converge for these graphs (for both algorithms, the convergence might be better for different types of graphs).

Finally, for time varying graphs with bounded interconnectivity times as above, the authors show that the convergence time of the averaging algorithm for the equal neighbor model is not bounded by a polynomial in {n} and {B}. However, they describe a different algorithm, variation of an old load balancing algorithm, which converges to an {\epsilon}-neighborhood of the average of the initial values in time {O(B n^3 log (1/\epsilon))}. This algorithm is not an averaging protocol as the ones described previously. In particular, it is a nonlinear scheme.

That’s it for today. I’ll post more on this topic soon.

References

[1] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann, April 1997.

[2] A. Jadbabaie, J. Lin, and A.S. Morse. “Coordination of Groups of Mobile Autonomous Agents Using Nearest Neighbor Rules”. IEEE Transactions on Automatic Control, Vol. 48, No. 6, June 2003.

[3] Alex Olshevksy and John Tsitsiklis, “Convergence Speed in Distributed Consensus and Averaging“, SIAM Journal on Control and Optimization, 48(1), p.33-55, 2009.

[4] J.N. Tsitsiklis, “Problems in Decentralized Decision Making and Computation”, Ph.D. Thesis, Department of EECS, MIT, 1984.

[5] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging”, Systems and Control Letters, vol 53, pp.65-78, 2004.

Werner Vogels (Amazon.com’s CTO) has an interesting article on his blog about the necessary trade-offs involved in building large reliable distributed databases. It seems that some of these ideas could be useful when thinking about building large sensing networks, for example groups of mobile robots (e.g. UAVs) collecting data on their environments (this is the main focus of the projects SWARMS and HUNT here at Penn). As mobile robotic networks grow in size, some robots will probably have to make decisions without having time to collect all the currently available useful information from all other robots. Moreover, just as with web services, a robotic network is expected to be highly available.  Currently if communication is lost with an UAV for only a few minutes, the UAV is programmed to return to its base automatically. Hence temporary communication loss can mean the end of a critical mission.  With a network of robots, we can think of implementing some degree of fault-tolerance, so that the service can still perform as expected even if some node or particular communication link fails. It might be useful in these scenarios to think about the fundamental limits to sharing consistent information between decision/sensing nodes.

One of the trade-off in distributed databases (DBs) is between high availability and data consistency. Database replication techniques aim at achieving consistency across different nodes, but as a system grows in size, availability becomes an issue. In 2000, Eric Brewer conjectured that in a shared-data system, only two of the following three properties can be achieved at the same time:

  • data consistency,
  • system availability,
  • and tolerance to network partition1.

Seth Gilbert and Nancy Lynch formalized this conjecture in a 2002 paper2. The system responding to client requests is a distributed shared memory. (Atomic) Consistency requires that the requests act as if they were executing on a single node, one at a time. Availability means that every request received by a non-failing node must result in a response (there is no bound on the response time here). To model partition tolerance, the network is allowed to lose arbitrarily many messages sent from one node to another. The first network model used in that paper is the asynchronous network model3. That is, there is no clock, and nodes must make decisions based only on the messages received and local computations. Under such difficult operating conditions, the impossibility result is clear. For example, consider a network which is available and tolerant to network partitions. Assume now that it is partitioned into 2 components G_1 and G_2, i.e., that all messages between the two components are lost. Consider a client that tries to write  data in component G_1, and once the data is written, reads it from G_2. Both operations will succeed by our availability assumption. Yet the read cannot return the updated value since no message managed to cross between G_1 and G_2. This violates the atomic consistency property. Moreover, in the asynchronous network model, replacing lost messages by arbitrarily delayed messages does not change the result.

Consider now a partially synchronous model, in which each node in the network has a clock. These clocks all increase at the same rate, but are not synchronized (they may give different values at the same real time). They can be used as local timers to measure the time elapsed since a given event. Also assume that messages not delivered within a known time t_{msg} are lost, and that every node processes a received message within a known time t_{local}. Even with this more powerful system, the impossibility result still holds when arbitrary messages may be lost, by an argument similar to the one for the asynchronous model.

Guaranteeing two of the three properties can be achieved by trivial algorithms. The more interesting problem is to guarantee two of the properties while guaranteeing a weaker version of the third. Web caches are one example of available, partition tolerant, and weakly consistent networks. More generally, for partially synchronous networks, we can design an (centralized) algorithm which is available, partition tolerant, and guarantees a return to consistency within some time limit if no message is lost for a certain time2 (see also the related notion of eventual consistency discussed by Vogels). In large databases used for web services, network partitions are a fact, and therefore complete consistency and system availability cannot be achieved at the same time in general. If a system emphasizes consistency, it may not always be available to take a write. If it emphasizes availability, a read might not always return the most recently completed write. Depending on the application, one might decide to give priority to one property over the other.

Reference

  1. Brewer, E. A. 2000. Towards robust distributed systems (abstract). In Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (July 16-19, Portland, Oregon): 7
  2. Gilbert , S., Lynch, N. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News 33(2).
  3. Lynch, N. 1996. Distributed Algorithms. Morgan Kaufman.