How do I decode unicode characters in the subject using email.message_from_string()?

Wed Feb 25 12:44:18 EST 2009

Steve Holden <steve at holdenweb.com> wrote:
> rdmurray at bitdance.com wrote:
> > Steve Holden <steve at holdenweb.com> wrote:
> >>>>> from email.header import decode_header
> >>>>> print
> >> decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
> >> [('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
> >> Records', 'us-ascii')]
> > 
> > It is interesting that decode_header does what I would consider to be
> > the right thing (from a pragmatic standpoint) with that particular bit
> > of Microsoft not-quite-standards-compliant brain-damage; but, removing
> > the tab is not in fact standards compliant if I'm reading the RFC
> > correctly.
> > 
> You'd need to quote me chapter and verse on that. I understood that the
> tab simply indicated continuation, but it's a *long* time since I read
> the RFCs.

Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character.  Header folding (insertion of <cr><lf>) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly:

   Each header field is logically a single line of characters comprising
   the field name, the colon, and the field body.  For convenience
   however, and to deal with the 998/78 character limitations per line,
   the field body portion of a header field can be split into a multiple
   line representation; this is called "folding".  The general rule is
   that wherever this standard allows for folding white space (not
   simply WSP characters), a CRLF may be inserted before any WSP.  For
   example, the header field:

           Subject: This is a test

   can be represented as:

           Subject: This
            is a test

   [irrelevant note elided]

   The process of moving from this folded multiple-line representation
   of a header field to its single line representation is called
   "unfolding". Unfolding is accomplished by simply removing any CRLF
   that is immediately followed by WSP.  Each header field should be
   treated in its unfolded form for further syntactic and semantic
   evaluation.

So, the whitespace characters are supposed to be left unchanged
after unfolding.

--David