[Spambayes] RE: Spam vs time-of-day

Tim Peters tim.one@comcast.net
Mon Oct 28 19:29:30 2002


[Skip Montanaro]
> "*his*" refers to Bruce, right?

Right.

> My contention after plotting time buckets was the same: that spam was
> generally sent at a continuous rate.

No, your graph and mine both show that it falls off in the early-morning
hours.  Offline I did a chi-squared test against the hypothesis that the
spam was evenly distributed, and the probability that random data could be
so skewed was < 1e-18.  But ham falls off much more.

> Ham, on the other hand, does have a strong diurnal pattern.

Very.

> I posted a gnuplot graph to that effect back at the end of September.
> That's what convinced me to try mining information from the Date: header.
> For completeness, I've attached my original graph.  I believe the x-axis
> is the 6-minute bucket offset, starting from midnight.

Your buckets span 10 minutes.  The comment in the code is confused about
this too.  That's why your graph and mine both have 144 points on the X axis
(24 * 6 = 144; you have six *buckets* per hour, and each spans 10 minutes).

> The large spike at 0 is an artifact of my simpleminded Date header
> scanning.  Invalid dates probably wound up with a value of 0.

And at that time, *every* Date header generated a dow:invalid token (as well
as the correct token, when possible).  That's been repaired since then.

> Buckets were calculated using local time.  That way I didn't penalize
> Anthony Baxter and other folks who happen not to live in the US.

I'm unsure what "were calculated using local time" means.  Does the checked
in code do that or not?  I took what the checked-in code produced at face
value (after untangling the hour.bucket_number format into hour.minute).

I doubt that it matters, though.  Most c.l.py traffic in my corpus is sent
from the U.S., and in any case enabling these things didn't help my results
(the spamprobs were too mild to make a difference).