Problem with accented characters in mailbox.Maildir()

Tue May 9 18:27:20 EDT 2023

On 2023-05-08 23:02:18 +0200, jak wrote:
> Peter J. Holzer ha scritto:
> > On 2023-05-06 16:27:04 +0200, jak wrote:
> > > Chris Green ha scritto:
> > > > Chris Green <cl at isbd.net> wrote:
> > > > > A bit more information, msg.get("subject", "unknown") does return a
> > > > > string, as follows:-
> > > > > 
> > > > >       Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> > [...]
> > > > ... and of course I now see the issue!  The Subject: with utf-8
> > > > characters in it gets spaces changed to underscores.  So searching for
> > > > '(Waterways Continental Europe)' fails.
> > > > 
> > > > I'll either need to test for both versions of the string or I'll need
> > > > to change underscores to spaces in the Subject: returned by msg.get().
[...]
> > > 
> > > subj = email.header.decode_header(raw_subj)[0]
> > > 
> > > subj[0].decode(subj[1])
[...]
> > email.header.decode_header returns a *list* of chunks and you have to
> > process and concatenate all of them.
> > 
> > Here is a snippet from a mail to html converter I wrote a few years ago:
> > 
> > def decode_rfc2047(s):
> >      if s is None:
> >          return None
> >      r = ""
> >      for chunk in email.header.decode_header(s):
[...]
> >                  r += chunk[0].decode(chunk[1])
[...]
> >      return r
[...]
> > 
> > I do have to say that Python is extraordinarily clumsy in this regard.
> 
> Thanks for the reply. In fact, I gave that answer because I did
> not understand what the OP wanted to achieve. In addition, the
> OP opened a second thread on the similar topic in which I gave a
> more correct answer (subject: "What do these '=?utf-8?' sequences
> mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").

Right. I saw that after writing my reply. I should have read all
messages, not just that thread before replying.

> the OP, I discovered that the MAME is not the only format used
> to compose the subject.

Not sure what "MAME" is. If it's a typo for MIME, then the base64
variant of RFC 2047 is just as much a part of it as the quoted-printable
variant.

> This made me think that a library could not delegate to the programmer
> the burden of managing all these exceptions,

email.header.decode_header handles both variants, but it produces bytes
sequences which still have to be decoded to get a Python string.

> then I have further investigated to discover that the library also
> provides the conversion function beyond that of coding and this makes
> our labors vain:
> 
> ----------
> from email.header import decode_header, make_header
> 
> subject = make_header(decode_header( raw_subject )))
> ----------

Yup. I somehow missed that. That's a lot more convenient than calling
decode in a loop (or generator expression). Depending on what you want
to do with the subject you may have wrap that in a call to str(), but
it's still a one-liner.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20230510/613ebbf2/attachment.sig>