Another 2 to 3 mail encoding problem

Thu Aug 27 04:34:47 EDT 2020

Peter J. Holzer <hjp-python at hjp.at> wrote:
> The problem is that the message contains a '\ufeff' character (byte
> order mark) where email/generator.py expects only ASCII characters.
> 
> I see two possible reasons for this:
> 
>  * The mbox writing code assumes that all messages with non-ascii
>    characters are QP or base64 encoded, and some higher layer uses 8bit
>    instead.
> 
>  * A mime-part is declared as charset=us-ascii but contains really
>    Unicode characters.
> 
> Both reasons are weird.
> 
> The first would be an unreasonable assumption (8bit encoding has been
> common since the mid-1990s), but even if the code made that assumption,
> one would expect that other code from the same library honors it.
> 
> The second shouldn't be possible: If a message is mis-declared (that
> happens) one would expect that the error happens during parsing, not
> when trying to serialize the already parsed message. 
> 
> But then you haven't shown where msg comes from. How do you parse the
> message to get "msg"?
> 
> Can you construct a minimal test message which triggers the bug?
> 
Yes, simply sending myself an E-Mail with (for example) accented
characters triggers the error.

I'm pretty certain my system (and E-Mail in and out, and Usenet news)
handle these correctly as UTF8.  E.g.:-

    àéçł

It's *only* when I switch the mail delivery to Python 3 that the error
appears.

-- 
Chris Green
·