How do I decode unicode characters in the subject using email.message_from_string()?
rdmurray at bitdance.com
rdmurray at bitdance.com
Wed Feb 25 12:44:18 EST 2009
Steve Holden <steve at holdenweb.com> wrote:
> rdmurray at bitdance.com wrote:
> > Steve Holden <steve at holdenweb.com> wrote:
> >>>>> from email.header import decode_header
> >>>>> print
> >> decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
> >> [('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
> >> Records', 'us-ascii')]
> >
> > It is interesting that decode_header does what I would consider to be
> > the right thing (from a pragmatic standpoint) with that particular bit
> > of Microsoft not-quite-standards-compliant brain-damage; but, removing
> > the tab is not in fact standards compliant if I'm reading the RFC
> > correctly.
> >
> You'd need to quote me chapter and verse on that. I understood that the
> tab simply indicated continuation, but it's a *long* time since I read
> the RFCs.
Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character. Header folding (insertion of <cr><lf>) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly:
Each header field is logically a single line of characters comprising
the field name, the colon, and the field body. For convenience
however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple
line representation; this is called "folding". The general rule is
that wherever this standard allows for folding white space (not
simply WSP characters), a CRLF may be inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
[irrelevant note elided]
The process of moving from this folded multiple-line representation
of a header field to its single line representation is called
"unfolding". Unfolding is accomplished by simply removing any CRLF
that is immediately followed by WSP. Each header field should be
treated in its unfolded form for further syntactic and semantic
evaluation.
So, the whitespace characters are supposed to be left unchanged
after unfolding.
--David
More information about the Python-list
mailing list