mailbox misbehavior with non-ASCII

Fri Jul 29 20:53:09 EDT 2022

On 2022-07-29 at 23:24:57 +0000,
Peter Pearson <pkpearson at nowhere.invalid> wrote:

> The following code produces a nonsense result with the input 
> described below:
> 
> import mailbox
> box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
> x = box.values()[0]
> h = x.get("X-DSPAM-Factors")
> print(type(h))
> # <class 'email.header.Header'>
> 
> The output is the desired "str" when the message file contains this:
> 
> To: recipient at example.com
> Message-ID: <123>
> Date: Sun, 24 Jul 2022 15:31:19 +0000
> Subject: Blah blah
> From: from at from.com
> X-DSPAM-Factors: a'b
> 
> xxx
> 
> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type 
> "email.header.Header", and seems to contain inscrutable garbage.
> 
> I realize that one should not put non-ASCII characters in
> message headers, but of course I didn't put it there, it
> just showed up, pretty much beyond my control.  And I realize
> that when software is given input that breaks the rules, one
> cannot expect optimal results, but I'd think an exception
> would be the right answer.

Be strict in what you send, but generous is what you receive.

I agree that email headers are supposed to be ASCII (RFC 822, 2822, and
now 5322) all say that, but always throwing an exception seems a little
harsh, and arguably (I'm not arguing for or against) breaks backwards
compatibility.  At least let the exception contain, in its own
attribute, the inscrutable garbage after the space after the colon and
before next CR/LF pair.

> Is this worth a bug report?

If nothing else, the documentation could specify or disclaim the
existing behavior.