Handling some isolated iso-8859-1 characters

Tue Jun 3 22:10:38 EDT 2008

On Jun 4, 2:38 am, Daniel Mahoney <d... at catfolks.net> wrote:
> I'm working on an app that's processing Usenet messages. I'm making a
> connection to my NNTP feed and grabbing the headers for the groups I'm
> interested in, saving the info to disk, and doing some post-processing.
> I'm finding a few bizarre characters and I'm not sure how to handle them
> pythonically.
>
> One of the lines I'm finding this problem with contains:
> 137050  Cleo and I have an anouncement!   "Mlle. =?iso-8859-1?Q?Ana=EFs?="
> <n... at aol.com>  Sun, 21 Nov 2004 16:21:50 -0500
> <lmzdkqmqt2fj.54wmpv3zmvvx.... at 40tude.net>              4478    69 Xref:
> sn-us rec.pets.cats.community:137050
>
> The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
> An HTML rendering of what this string should look would be "Anaïs".
>
> What I'm doing now is a brute-force substitution from the version in the
> file to the HTML version. That's ugly. What's a better way to translate
> that string? Or is my problem that I'm grabbing the headers from the NNTP
> server incorrectly?

>>> from email.Header import decode_header
>>> decode_header("=?iso-8859-1?Q?Ana=EFs?=")
[('Ana\xefs', 'iso-8859-1')]
>>> (s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=")
>>> s
'Ana\xefs'
>>> e
'iso-8859-1'
>>> s.decode(e)
u'Ana\xefs'
>>> import unicodedata
>>> import htmlentitydefs
>>> for c in s.decode(e):
... 	print ord(c), unicodedata.name(c)
...
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S
>>> htmlentitydefs.codepoint2name[239]
'iuml'
>>>