character encoding conversion
"Martin v. Löwis"
martin at v.loewis.de
Mon Dec 13 17:59:51 EST 2004
Christian Ergh wrote:
> Once more, indention should be correct now, and the 128 is gone too. So,
> something like this?
Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.
Also, it might be possible to do this in a for loop, e.g.
for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
"UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
try:
data = data.encode(encoding)
break;
except UnicodeError:
pass
You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.
> # if it is not in the pagecode, how do i get the encoding of the page?
> pageencoding = '???'
You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.
> xmlencoding = 'whatever i parsed out of the file'
> htmlmetaencoding = 'whatever i parsed out of the metatag'
Depending on the library you use, these aren't that trivial, either.
Regards,
Martin
More information about the Python-list
mailing list