Smoothing a discrete set of data

Sat Sep 7 06:02:02 EDT 2002

This isn't so much a Python question, as an algorithms question. I
couldn't find a newsgroup for discussing algorithms (is there a good
one?) so as I'm implementing the code in Python I thought I'd try here
- I hope that's OK.

I have a set of data, basically a histogram. The data is pretty
variable, and I'd like to "smooth" it to find trends. Actually, this
comes up a *lot* for me - some examples: sampled IO rates on a machine
- I want to look for trends (is IO higher overnight or during the day,
etc) or fuel consumption for my car (do I use less fuel since I had
the service).

Normally, what I do with things like this is draw a graph, and try to
spot the trends "by eyeball". But it feels to me like I should be able
to write something which smooths the data out. I just don't see how
:-)

Take an example - here's a rough ASCII-art diagram

  9     XX                                  XX
  8     XX XX                               XX
  7  XX XX XX                               XX XX
  6  XX XX XX                            XX XX XX
  5  XX XX XX                            XX XX XX
  4  XX XX XX                            XX XX XX
  3  XX XX XX XX          XX          XX XX XX XX
  2  XX XX XX XX XX    XX XX XX    XX XX XX XX XX
  1  XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX

This seems to me to have very clear peaks at the left and right, and a
shallow point in the middle. I'd say that you could "smooth" this to
something that averaged 8 for the first 3 bars, then 2 for the next 9,
then 7 for the last 3. But try to make that precise...

When I've tried to formalise this, I've always stumbled over the fact
that it's a trade-off exercise (there most be some sort of "tolerance"
parameter in the algorithm) - on the one hand, the original data is
"best" (in the sense that it matches exactly) whereas on the other
hand, a single horizontal line at the average is "best" (fewest number
of steps).

My instinct is that you're looking for something that balances number
of steps vs difference from the original data. But for the life of me
I can't see how to make this intuition precise.

Can anyone help me? I can't believe that this problem has never come
up before, but I can't find any literature on it.

Thanks in advance,
Paul Moore.