Problem with accented characters in mailbox.Maildir()

Mon May 8 17:02:18 EDT 2023

Peter J. Holzer ha scritto:
> On 2023-05-06 16:27:04 +0200, jak wrote:
>> Chris Green ha scritto:
>>> Chris Green <cl at isbd.net> wrote:
>>>> A bit more information, msg.get("subject", "unknown") does return a
>>>> string, as follows:-
>>>>
>>>>       Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> [...]
>>> ... and of course I now see the issue!  The Subject: with utf-8
>>> characters in it gets spaces changed to underscores.  So searching for
>>> '(Waterways Continental Europe)' fails.
>>>
>>> I'll either need to test for both versions of the string or I'll need
>>> to change underscores to spaces in the Subject: returned by msg.get().
> 
> You need to decode the Subject properly. Unfortunately the Python email
> module doesn't do that for you automatically. But it does provide the
> necessary tools. Don't roll your own unless you've read and understood
> the relevant RFCs.
> 
>>
>> This is probably what you need:
>>
>> import email.header
>>
>> raw_subj =
>> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
>>
>> subj = email.header.decode_header(raw_subj)[0]
>>
>> subj[0].decode(subj[1])
>>
>> 'aka Marne à la Saône (Waterways Continental Europe)'
> 
> You are an the right track, but that works only because the example
> exists only of a single encoded word. This is not always the case (and
> indeed not what the RFC recommends).
> 
> email.header.decode_header returns a *list* of chunks and you have to
> process and concatenate all of them.
> 
> Here is a snippet from a mail to html converter I wrote a few years ago:
> 
> def decode_rfc2047(s):
>      if s is None:
>          return None
>      r = ""
>      for chunk in email.header.decode_header(s):
>          if chunk[1]:
>              try:
>                  r += chunk[0].decode(chunk[1])
>              except LookupError:
>                  r += chunk[0].decode("windows-1252")
>              except UnicodeDecodeError:
>                  r += chunk[0].decode("windows-1252")
>          elif type(chunk[0]) == bytes:
>              r += chunk[0].decode('us-ascii')
>          else:
>              r += chunk[0]
>      return r
> 
> (this is maybe a bit more forgiving than the OP needs, but I had to deal
> with malformed mails)
> 
> I do have to say that Python is extraordinarily clumsy in this regard.
> 
>          hp
> 

Thanks for the reply. In fact, I gave that answer because I did
not understand what the OP wanted to achieve. In addition, the
OP opened a second thread on the similar topic in which I gave a
more correct answer (subject: "What do these '=?utf-8?' sequences
mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").
I was interested in this thread because a few years ago I wrote a
program in C that sent, via email, the log file of an application
in the event that it crashed and I had created the attachment
based64, however at the time I did not know of the RFC2047
relating to the subject. In addition, investigating the needs of
the OP, I discovered that the MAME is not the only format used
to compose the subject. I found an example in a thread of same
days ago where the subject contained Arabic text (sender:
"Uhrda education <Fatmaelhlwany9 at gmail.com>", date: "Wed, 03
May 2023 00:18:14 UTC"). This is the raw version of the subject:

=?UTF-8?B?2LTZh9in2K/YqSDYo9iu2LXYp9im2Yog2K7Yr9mF2Kkg2LnZhdmE2KfYoSDZhdi52KrZhQ==?=

=?UTF-8?B?2K8gI9in2YjZhtmE2KfZitmGINio2LHYs9mI2YUg2YXYrtmB2LbYqSDYrtmE2KfZhCDYtNmH2LEg2YU=?=

=?UTF-8?B?2KfZitmIMjAyMyDZhNmE2KfYs9iq2YHYs9in2LEg2YjYp9iq2LMgLyAwMDIwMTAwOTMwNjExMQ==?=

As you can see, the penultimate letter of the header is not a
'q' as in the OP message but it is a 'b' and the body of the
message is covered according to the base64. This made me think
that a library could not delegate to the programmer the burden of
managing all these exceptions, then I have further investigated
to discover that the library also provides the conversion
function beyond that of coding and this makes our labors vain:

----------
from email.header import decode_header, make_header

subject = make_header(decode_header( raw_subject )))
----------

This line of code correctly converts the message of the OP
and also the one with the text in Arabic.

I greet you with cordiality.