Problem with accented characters in mailbox.Maildir()

Mon May 8 14:36:18 EDT 2023

On 2023-05-06 16:27:04 +0200, jak wrote:
> Chris Green ha scritto:
> > Chris Green <cl at isbd.net> wrote:
> > > A bit more information, msg.get("subject", "unknown") does return a
> > > string, as follows:-
> > > 
> > >      Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
[...]
> > ... and of course I now see the issue!  The Subject: with utf-8
> > characters in it gets spaces changed to underscores.  So searching for
> > '(Waterways Continental Europe)' fails.
> > 
> > I'll either need to test for both versions of the string or I'll need
> > to change underscores to spaces in the Subject: returned by msg.get().

You need to decode the Subject properly. Unfortunately the Python email
module doesn't do that for you automatically. But it does provide the
necessary tools. Don't roll your own unless you've read and understood
the relevant RFCs.

> 
> This is probably what you need:
> 
> import email.header
> 
> raw_subj =
> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
> 
> subj = email.header.decode_header(raw_subj)[0]
> 
> subj[0].decode(subj[1])
> 
> 'aka Marne à la Saône (Waterways Continental Europe)'

You are an the right track, but that works only because the example
exists only of a single encoded word. This is not always the case (and
indeed not what the RFC recommends).

email.header.decode_header returns a *list* of chunks and you have to
process and concatenate all of them.

Here is a snippet from a mail to html converter I wrote a few years ago:

def decode_rfc2047(s):
    if s is None:
        return None
    r = ""
    for chunk in email.header.decode_header(s):
        if chunk[1]:
            try:
                r += chunk[0].decode(chunk[1])
            except LookupError:
                r += chunk[0].decode("windows-1252")
            except UnicodeDecodeError:
                r += chunk[0].decode("windows-1252")
        elif type(chunk[0]) == bytes:
            r += chunk[0].decode('us-ascii')
        else:
            r += chunk[0]
    return r

(this is maybe a bit more forgiving than the OP needs, but I had to deal
with malformed mails)

I do have to say that Python is extraordinarily clumsy in this regard.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20230508/738a7e60/attachment.sig>