Handling some isolated iso-8859-1 characters

Tue Jun 3 22:12:00 EDT 2008

En Tue, 03 Jun 2008 15:38:09 -0300, Daniel Mahoney <dan at catfolks.net>  
escribió:

> I'm working on an app that's processing Usenet messages. I'm making a
> connection to my NNTP feed and grabbing the headers for the groups I'm
> interested in, saving the info to disk, and doing some post-processing.
> I'm finding a few bizarre characters and I'm not sure how to handle them
> pythonically.
>
> One of the lines I'm finding this problem with contains:
> 137050  Cleo and I have an anouncement!   "Mlle.  
> =?iso-8859-1?Q?Ana=EFs?="
> <not at aol.com>  Sun, 21 Nov 2004 16:21:50 -0500
> <lmzdkqmqt2fj.54wmpv3zmvvx.dlg at 40tude.net>              4478    69 Xref:
> sn-us rec.pets.cats.community:137050
>
> The interesting patch is the string that reads  
> "=?iso-8859-1?Q?Ana=EFs?=".
> An HTML rendering of what this string should look would be "Anaïs".
>
> What I'm doing now is a brute-force substitution from the version in the
> file to the HTML version. That's ugly. What's a better way to translate
> that string? Or is my problem that I'm grabbing the headers from the NNTP
> server incorrectly?

No, it's not you, those headers are formatted following RFC 2047  
<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,  
see <http://docs.python.org/lib/module-email.header.html>

-- 
Gabriel Genellina