[Spambayes] splitndirs bug [need help]

Tim Peters tim@zope.com
Fri, 4 Oct 2002 12:56:02 -0400


[Richie Hindle]
> ...
> This is because mboxutils.py is opening the mailbox file in text mode,
> but the Python mailbox library uses tell() and seek() to navigate
> around the file, which is no good with text-mode files on Windows.

Not quite.  seek and tell are fine with text-mode Windows files, provided
you stick to what C guarantees about them with text mode files:  you can
seek to a position previously returned by tell(), but that's essentially
*all* that's defined.  In particular, trying to do arithmetic on text-mode
tell() results has no meaning, and Stephen found code doing

> a call exists to "self.fp.read(length).  Now length is defined from
> self.stop - self.pos.

*That* makes no sense for text-mode files on Windows.  (BTW, good detective
work, Stephen!)

> I've patched my mboxutils.py by changing the third-to-last line of
> mboxutils.py from:
>
>         fp = open(name)
>
> to
>
>         fp = open(name, "rb")
>
> and that seemed to fix it.

Yes!  Please check that in.  Besides the seek/tell business, opening a mail
archive in text mode under Windows is likely to truncate the data
prematurely, if the archive contains any 8-bit chars (the first instance of
chr(26) is taken to mean EOF in Windows text mode).

> I've been meaning to commit this, but I need to work out whether
> reading the '\r\n' line endings will break anything (Tim?)

I can't say for sure, but if it does I'll fix it.  Offhand, the only pieces
that *might* be vulnerable are regular expressions assuming plain \n line
endings, but it's unlikely they would fall into a trap here.  I normalized
all line endings to plain \n in my data, BTW:  before that, all my spam had
\r\n, and all my ham plain \n, and when experimenting with character n-grams
the mere fact of different line endings proved to be a killer strong clue!

Bottom lines:  all mail files should always be opened in binary mode, and
spambayes code should never be sensitive to line endings.