character encoding conversion

"Martin v. Löwis" martin at v.loewis.de
Mon Dec 13 17:59:51 EST 2004


Christian Ergh wrote:
> Once more, indention should be correct now, and the 128 is gone too. So, 
> something like this?

Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.

Also, it might be possible to do this in a for loop, e.g.

for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
                  "UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
     try:
        data = data.encode(encoding)
        break;
     except UnicodeError:
        pass

You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.

> # if it is not in the pagecode, how do i get the encoding of the page?
> pageencoding = '???'

You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.

> xmlencoding  = 'whatever i parsed out of the file'
> htmlmetaencoding = 'whatever i parsed out of the metatag'

Depending on the library you use, these aren't that trivial, either.

Regards,
Martin



More information about the Python-list mailing list