[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.12,1.13
Tim Peters
tim_one at users.sourceforge.net
Fri Jun 27 19:51:59 EDT 2003
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv29852/spambayes
Modified Files:
tokenizer.py
Log Message:
A new stripper to squash yet another way of hiding content in HTML spam,
like
Ere<frame><noframes>ywl55</noframes></frame>ctions
to hide Erections. I haven't seen this often yet, but (of course) it's
been effective so far.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** tokenizer.py 18 Jun 2003 15:08:59 -0000 1.12
--- tokenizer.py 28 Jun 2003 01:51:57 -0000 1.13
***************
*** 1011,1014 ****
--- 1011,1022 ----
crack_html_comment = CommentStripper().analyze
+ # Nuke stuff between <noframes> </noframes> tags.
+ class NoframesStripper(Stripper):
+ def __init__(self):
+ Stripper.__init__(self,
+ re.compile(r"<\s*noframes\s*>").search,
+ re.compile(r"</noframes\s*>").search)
+
+ crack_noframes = NoframesStripper().analyze
# Scan HTML for constructs often seen in viruses and worms.
***************
*** 1392,1396 ****
crack_urls,
crack_html_style,
! crack_html_comment):
text, tokens = cracker(text)
for t in tokens:
--- 1400,1405 ----
crack_urls,
crack_html_style,
! crack_html_comment,
! crack_noframes):
text, tokens = cracker(text)
for t in tokens:
More information about the Spambayes-checkins
mailing list