[spambayes-dev] fp: innocuous text in hidden html <input>
Doug Wyatt
doug at sonosphere.com
Fri Oct 3 10:40:09 EDT 2003
On Oct 3, 2003, at 10:16, Skip Montanaro wrote:
> Doug> I have to admit this is a clever spam technique -- I've
> taken a
> Doug> quick look in the archives and read through tokenizer.py and
> seen
> Doug> nothing about it. The trick is that the message has a number
> of
> Doug> <input> elements with type=hidden and value=something very
> hammy.
>
> I believe the tokenizer strips out all HTML tags, at least it makes a
> good
> effort to do so. It uses a fancy-schmancy regular expression Tim
> Peters
> wrote to make it fast, but I believe it's also limited in what it
> believes
> the maximum length of an HTML tag can be:
>
> # Cheap-ass gimmick to probabilistically find HTML/XML tags.
> # Note that <style and HTML comments are handled by
> crack_html_style()
> # and crack_html_comment() instead -- they can be very long, and
> long
> # minimal matches have a nasty habit of blowing the C stack.
> html_re = re.compile(r"""
> <
> (?![\s<>]) # e.g., don't match 'a < b' or '<<<' or 'i<<5' or
> 'a<>b'
> # guessing that other tags are usually "short"
> [^>]{0,256} # search for the end '>', but don't run wild
>>
> """, re.VERBOSE | re.DOTALL)
>
> It's not completely obvious, but it appears the <input> tag in your
> message
> contains over 300 characters, so it would be missed by the above
> regular
> expression. I don't know if it's time to try something different,
> boost the
> above 256 to something larger, or do nothing and rely on more training
> to
> squash that bug.
Thanks, Skip, I see now ...
Maybe the thing to do would be to continue using the cheap gimmick
(it's efficient), then make another pass looking for longer elements?
> Do you still have that message so you can post it as an attachment in
> its
> entirety?
Sure ...
Doug
-------------- next part --------------
Skipped content of type multipart/appledouble-------------- next part --------------
More information about the spambayes-dev
mailing list