[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.12,1.13

Tim Peters tim_one at users.sourceforge.net
Fri Jun 27 19:51:59 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv29852/spambayes

Modified Files:
	tokenizer.py 
Log Message:
A new stripper to squash yet another way of hiding content in HTML spam,
like

    Ere<frame><noframes>ywl55</noframes></frame>ctions

to hide Erections.  I haven't seen this often yet, but (of course) it's
been effective so far.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** tokenizer.py	18 Jun 2003 15:08:59 -0000	1.12
--- tokenizer.py	28 Jun 2003 01:51:57 -0000	1.13
***************
*** 1011,1014 ****
--- 1011,1022 ----
  crack_html_comment = CommentStripper().analyze
  
+ # Nuke stuff between <noframes> </noframes> tags.
+ class NoframesStripper(Stripper):
+     def __init__(self):
+         Stripper.__init__(self,
+                           re.compile(r"<\s*noframes\s*>").search,
+                           re.compile(r"</noframes\s*>").search)
+ 
+ crack_noframes = NoframesStripper().analyze
  
  # Scan HTML for constructs often seen in viruses and worms.
***************
*** 1392,1396 ****
                              crack_urls,
                              crack_html_style,
!                             crack_html_comment):
                  text, tokens = cracker(text)
                  for t in tokens:
--- 1400,1405 ----
                              crack_urls,
                              crack_html_style,
!                             crack_html_comment,
!                             crack_noframes):
                  text, tokens = cracker(text)
                  for t in tokens:





More information about the Spambayes-checkins mailing list