Another 2 to 3 mail encoding problem

Barry barry at barrys-emacs.org
Thu Aug 27 12:16:51 EDT 2020



> On 27 Aug 2020, at 10:40, Chris Green <cl at isbd.net> wrote:
> 
> Karsten Hilbert <Karsten.Hilbert at gmx.net> wrote:
>>> Terry Reedy <tjreedy at udel.edu> wrote:
>>>>> On 8/26/2020 11:10 AM, Chris Green wrote:
>>>>> 
>>>>>> I have a simple[ish] local mbox mail delivery module as follows:-
>>>>> ...
>>>>>> It has run faultlessly for many years under Python 2.  I've now
>>>>>> changed the calling program to Python 3 and while it handles most
>>>>>> E-Mail OK I have just got the following error:-
>>>>>> 
>>>>>>     Traceback (most recent call last):
>>>>>>       File "/home/chris/.mutt/bin/filter.py", line 102, in <module>
>>>>>>         mailLib.deliverMboxMsg(dest, msg, log)
>>>>> ...
>>>>>>       File "/usr/lib/python3.8/email/generator.py", line 406, in write
>>>>>>         self._fp.write(s.encode('ascii', 'surrogateescape'))
>>>>>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in
>>>>> position 4: ordinal not in range(128)

I would guess the fix is do s.encode(‘utf-8’).

You might need to add a header to say that you are using utf-8 to the email/mime-part.

If you do that does your code work?

Barry


>>>>> 
>>>>> '\ufeff' is the Unicode byte-order mark.  It should not be present in an
>>>>> ascii-only 3.x string and would not normally be present in general
>>>>> unicode except in messages like this that talk about it.  Read about it,
>>>>> for instance, at
>>>>> https://en.wikipedia.org/wiki/Byte_order_mark
>>>>> 
>>>>> I would catch the error and print part or all of string s to see what is
>>>>> going on with this particular message.  Does it have other non-ascii chars?
>>>>> 
>>> I can provoke the error simply by sending myself an E-Mail with
>>> accented characters in it.  I'm pretty sure my Linux system is set up
>>> correctly for UTF8 characters, I certainly seem to be able to send and
>>> receive these to others and I even get to see messages in other
>>> scripts such as arabic, chinese, etc.
>>> 
>>> The code above works perfectly in Python 2 delivering messages with
>>> accented (and other extended) characters with no problems at all.
>>> Sending myself E-Mails with accented characters works OK with the code
>>> running under Python 2.
>>> 
>>> While an E-Mail body possibly *shouldn't* have non-ASCII characters in
>>> it one must be able to handle them without errors.  In fact haven't
>>> the RFCs changed such that the message body should be 8-bit clean?
>>> Anyway I think the Python 3 mail handling libraries need to be able to
>>> pass extended characters through without errors.
>> 
>> Well, '\ufeff' is not a *character* at all in much of any
>> sense of that word in unicode.
>> 
>> It's a marker. Whatever puts it into the stream is wrong. I guess the
>> best one can (and should) do is to catch the exception and dump
>> the offending stream somewhere binary-capable and pass on a notice. What
>> you are receiving there very much isn't a (well-formed) e-mail message.
>> 
>> I would then attempt to backwards-crawl the delivery chain to
>> find out where it came from.
>> 
> The error seems to occur with any non-7-bit-ASCII, e.g. my accented
> characters gave:-
> 
>  File "/usr/lib/python3.8/email/generator.py", line 406, in write
>      self._fp.write(s.encode('ascii', 'surrogateescape'))
>      UnicodeEncodeError: 'ascii' codec can't encode character
>      '\u2019' in position 34: ordinal not in
>       range(128)
> 
> It just happened that the first example was an escape.
> 
> -- 
> Chris Green
> ·
> -- 
> https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list