December 2002 comp.lang.* stats

Sat Jan 25 22:14:00 EST 2003

Erik Max Francis wrote:
> 
> Peter Hansen wrote:
> 
> > Spam is probably a problem best ignored.  It would probably
> > affect all those groups equally anyway.
> 
> Actually, that's one of the problems with his collapsing hierarchies
> into a single number.  To first order, spammers would probably post to
> every comp.* group with the same frequency.  So if a hierarchy contains
> six groups, the raw numbers will likely be overcounting spam by
> approximately a factor of six, as compared to a solitary newsgroup.

I would think that removing unique posters would eliminate a lot
of this effect, as the same poster would be sending to each newsgroup.
Yes, many use random addresses... but don't they still send in bulk?

> To second order, there's probably an additional effect of newsgroups
> with names that sort lexicographically early getting more spam, since
> more spammers do their spams sequentially, and those that get forcibly
> stopped will be less likely to hit comp.lang.z than comp.lang.a.

I strongly doubt anyone gets stopped fast enough to prevent their 
spamming one comp.lang group shortly after they've done another one.

In the end, my comment should really be taken as "spam is a small
enough issue, in my experience, to be ignored in the results as
mere noise".  I readily admit my experience is limited to c.l.p
and several other groups *not* in the c.l. hierarchy, so maybe 
some of those other groups get *much* more spam than c.l.p, but
I sort of doubt it.  Maybe someone will take the time to calculate
actual numbers to prove or disprove this point.  I wouldn't bother
though.

-Peter