[Spambayes] more date field mining

Skip Montanaro skip@pobox.com
Tue, 1 Oct 2002 01:10:00 -0500


I have now modified the Tokenizer class thus:

    class Tokenizer:

        date_hms_re = re.compile(r' (?P<hour>[0-9][0-9]):'
                                 r'(?P<minute>[0-9][0-9]):'
                                 r'(?P<second>[0-9][0-9]) ')

        date_formats = ("%a, %d %b %Y %H:%M:%S (%Z)",
                        "%a, %d %b %Y %H:%M:%S %Z",
                        "%d %b %Y %H:%M:%S (%Z)",
                        "%d %b %Y %H:%M:%S %Z")

        ...

        def tokenize_headers(self, msg):
            # Special tagging of header lines and MIME metadata.

            ...

            if options.mine_date_headers:
                for header in msg.get_all("date", ()):
                    mat = self.date_hms_re.search(header)
                    # return the time in Date: headers arranged in
                    # six-minute buckets
                    if mat is not None:
                        h = int(mat.group('hour'))
                        bucket = int(mat.group('minute')) // 10
                        yield 'time:%02d:%d' % (h, bucket)

                    # extract the day of the week
                    for fmt in self.date_formats:
                        try:
                            timetuple = time.strptime(header, fmt)
                        except ValueError:
                            pass
                        else:
                            yield 'dow:%d' % timetuple[6]
                    else:
                        yield 'dow:invalid'

Times and days of the week seem like they should be pretty distinct.  I
should probably analyze them separately using two options.  Still, here are
my initial results using this coarser grained scheme:

    cutoffs -> times
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ...

    false positive percentages
        1.000  1.000  tied          
        1.500  1.500  tied          
        1.000  1.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        1.500  1.500  tied          
        3.500  3.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        1.500  2.000  lost   +33.33%

    won   0 times
    tied  8 times
    lost  2 times

    total unique fp went from 30 to 32 lost    +6.67%
    mean fp % went from 1.5 to 1.6 lost    +6.67%

    false negative percentages
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        2.000  2.000  tied          
        0.000  0.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        0.000  0.000  tied          
        1.500  1.500  tied          

    won   0 times
    tied  9 times
    lost  1 times

    total unique fn went from 17 to 18 lost    +5.88%
    mean fn % went from 0.85 to 0.9 lost    +5.88%

    ham mean                     ham sdev
      20.82   21.05   +1.10%        6.43    6.47   +0.62%
      21.86   22.00   +0.64%        6.63    6.61   -0.30%
      21.38   21.56   +0.84%        6.49    6.57   +1.23%
      21.96   22.13   +0.77%        6.26    6.27   +0.16%
      21.51   21.73   +1.02%        6.72    6.73   +0.15%
      21.66   21.88   +1.02%        6.98    7.01   +0.43%
      21.45   21.62   +0.79%        7.66    7.59   -0.91%
      21.74   21.93   +0.87%        6.69    6.67   -0.30%
      21.71   21.88   +0.78%        7.44    7.43   -0.13%
      21.87   22.01   +0.64%        5.93    5.93   +0.00%

    ham mean and sdev for all runs
      21.60   21.78   +0.83%        6.75    6.75   +0.00%

    spam mean                    spam sdev
      74.10   73.79   -0.42%       12.99   12.71   -2.16%
      72.47   72.11   -0.50%       13.92   13.63   -2.08%
      74.05   73.75   -0.41%       13.00   12.80   -1.54%
      74.00   73.68   -0.43%       12.27   12.03   -1.96%
      72.43   72.06   -0.51%       13.73   13.33   -2.91%
      72.68   72.35   -0.45%       13.27   13.04   -1.73%
      72.57   72.29   -0.39%       13.03   12.84   -1.46%
      71.50   71.26   -0.34%       12.12   11.95   -1.40%
      73.25   72.92   -0.45%       12.67   12.39   -2.21%
      73.02   72.73   -0.40%       12.44   12.24   -1.61%

    spam mean and sdev for all runs
      73.01   72.69   -0.44%       12.98   12.73   -1.93%

    ham/spam mean difference: 51.41 50.91 -0.50

I'll try it with a more fine-grained set of options tomorrow after a little
snooze.

Skip