[Spambayes] defaults vs. chi-square

Tim Peters tim.one@comcast.net
Mon, 14 Oct 2002 23:03:08 -0400


[T. Alexander Popiel, tracks down a source of his many "skip" tokens]
> ...
> It appears to be a systematic error when a mailing list manager
> appends plain text to what should be a base64 encoded segment.
> Bad MLM, no biscuit.  This confuses the MIME decoder. Bad MIME
> decoder, too!
>
> As a sample:
>
> """
> [headers]
> ...
> Content-Type: text/plain
> Content-Transfer-Encoding: base64
> ...
>
> DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu
> ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg
> YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu
> IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp
> 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u
> ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl
> cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv
> bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp
> bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh
> a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l
> eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k
> aS5jb20NCg0KDQoNCg0K
>
>
> --
> To UNSUBSCRIBE, email to debian-java-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact
> listmaster@lists.debian.org

Ouch.  That would do it, all right, here in tokenizer.py:

        for part in textparts(msg):
            # Decode, or take it as-is if decoding fails.
            try:
                text = part.get_payload(decode=True)
            except:
                yield "control: couldn't decode"
                text = part.get_payload(decode=False)

The base64 decoder will barf on that kind of msg, but you've got so many of
these in your ham that even the "couldn't decode" metatoken is taken as a
strong ham clue:

    prob("control: couldn't decode") = 0.0652174

I overlooked that in your msg before.

So, Barry, what can we do about this?  Filling the database with "skip"
tokens from raw base64 is a Bad Idea, and I assume the email pkg doesn't
know how to, e.g., "decode base64 up until it can't anymore, and then grab
the rest as plain text".  Heh -- just writing that made me want to puke.  We
have to do something better with this, though.