How do I decode unicode characters in the subject using email.message_from_string()?

rdmurray at bitdance.com rdmurray at bitdance.com
Wed Feb 25 08:39:58 EST 2009


John Machin <sjmachin at lexicon.net> wrote:
> On Feb 25, 11:07=A0am, "Roy H. Han" <starsareblueandfara... at gmail.com>
> wrote:
> > Dear python-list,
> >
> > I'm having some trouble decoding an email header using the standard
> > imaplib.IMAP4 class and email.message_from_string method.
> >
> > In particular, email.message_from_string() does not seem to properly
> > decode unicode characters in the subject.
> >
> > How do I decode unicode characters in the subject?
> 
> You don't. You can't. You decode str objects into unicode objects. You
> encode unicode objects into str objects. If your input is not a str
> object, you have a problem.

I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question.  Suppose you have a Subject line that
looks like this:

    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode?  The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out.  I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program.  The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

-------------------------------------------------------------------
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string("""\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
""")

print x
print "--------------------"
for key, header in x.items():
    print key, 'type', type(header)
    print key+":", unicode(Header(header)).decode('utf-8')
    print key+":", decode_header(header)
    print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
-------------------------------------------------------------------


    From nobody Wed Feb 25 08:35:29 2009
    To: test
    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
            =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

    this is a test.

    --------------------
    To type <type 'str'>
    To: test
    To: [('test', None)]
    To: test
    Subject type <type 'str'>
    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
    Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
    Subject: 'u' Obselete type-- it is identical to 'd'. (7)


--RDM




More information about the Python-list mailing list