[Mailman-Developers] Splitting long header lines and the "bug demonstration"

Fri, 28 Jun 2002 19:36:11 -0400

Earlier this month (seems like years ago ;), I said:

    BAW> I know what is causing

    BAW>
    BAW> http://mail.python.org/pipermail-21/mailman-developers/2002-June/012093.html

    BAW> It's the email package, and specifically it's Generator's
    BAW> behavior when splitting long lines,
    BAW> i.e. Generator._split_header().

    BAW> I actually think that method is terminally broken because it
    BAW> has a ascii bias.  I doubt it would even work for long lines
    BAW> of encoded text.

    BAW> The good news is that we have a perfectly fine line splitter
    BAW> that understands encoded headers and does the RFC-correct
    BAW> thing.  It's called Header.encode().

    BAW> The bad news is that Header.encode() isn't parameter aware
    BAW> and making _split_header() call it will cause some messages
    BAW> to not generate idempotently.  If your code is expecting the
    BAW> splitting done by _split_header(), your code will break.

    BAW> Interestingly enough, only 4 unit tests fail when I make this
    BAW> change.

I've worked on resurrecting Generator._split_header() into
Header.encode() when the charset is None or 'us-ascii'.  Basically in
that case, we'll split on the highest syntactic boundary we know
about, which means either semicolons or spaces.  I don't handle other
syntax in other types of structured headers.

Even this might not be completely correct since Header (still) doesn't
know about parameters.  In the long term, it probably makes sense to
move parameter parsing from the Message class to the Header class, use
that for syntactic splitting, and change Message to use Header
instances.  But that's more work than I have time for now.

It may make sense to use the "ascii splitter" on other types of
character sets, but I'm not going to think about that right now. :)

Last time, Steve Turnbull commented:

    SJT> If by "split" you mean RFC 2822 2.2.3 header folding, you
    SJT> can't split "spooge" like that, can you?  There's no
    SJT> whitespace between the `o's.

And now, if a line has no semicolons or spaces, it doesn't get split.
This could cause us to violate RFC 2822's hard maximum of 998
characters in a line, but I'm not going to worry about this.  Some day
we might want a verification stage, or raise an exception, etc.

I think I have updated test cases to cover all this, but I haven't yet
tested it in MM2.1, so I may find that I'm missing something.

Finally, some of the Japanese test cases broke.  Since there's no way
for me to know whether they're still correct, I simply changed the
tests.  But I'd appreciate it if someone with Japanese and RFC 2047
expertise could take a closer look.

Commits going into cvs momentarily.
-Barry