help with (x)html / xml encoding...

Fri Mar 21 05:55:05 EST 2003

Steven Taschuk wrote:
> You should also check the data in urlopen(foo).info() for a
> Content-Type header; the value of that header is supposed
> to take precedence over either of the above.

hello,

thanks, but i'm still confused... here is an example :

example.html :
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
é : Ã©<br />
à : Ã <br />
</body>
</html>

and in python i do :
>>> import urllib
>>> sock = urllib.urlopen('http://192.168.0.1/example.html')
>>> sock.info().getencoding()
'7bit'

that is my file example.html is actually encoded using 7bit (which seems ok
to me as some utf-8 need two chars like Ã©) but if i need to "understand"
what it contains i still must check for <content-type>...

>>> data = sock.read()
>>> sock.close()
>>> unicode(data, "utf8").encode("cp850")
print unicode(data, "utf8").encode("iso-8859-1")
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
é : é<br />
à : à<br />
</body>
</html>

and finally, never use this '7bit' encoding retrieved from
sock.info().getencoding(),

am i missing something ???

thanks,