Monday, November 29, 2010

An almost optimal algorithm for locating long runs of zeros in streaming data

The problem. We are given a large number N of data streams, where each data stream consists of n nonnegative integer values arriving over time. At each time step, before the new values arrive, we choose k of these streams, where k << N, and after the values arrive, we get a score equal to the total of the new data values from all k streams. There are various restrictions on how to choose these k streams, the details of which we do not go into here. The objective is to maximize the total score over the course of the n steps.

Our results. Although that is the intended objective, we developed an algorithm that achieves something quite different, in that it locates the streams with long runs of 0s. In fact it also picks out the boundaries of such runs almost optimally, e.g. it switches to another data stream precisely when the next data value is going to be large.

The algorithm. The algorithm can be described very simply as "madness in the author's head".

Evaluation of results. We run our algorithm on the English Premier League 2010/11 season data. This means N > 500, n = 38, k = 11 (or 15). At the time of writing, only 15 weeks have passed, but we already obtained some impressive results:


Figure 1: The algorithm selected this data stream in the weeks shown (in red rectangle). Note how it narrowly missed the two large numbers around the region.


Figure 2: The algorithm picked the precise moment to switch from the first stream to the second, avoiding the two 1s.

No comments: