help with (x)html / xml encoding...
lt
glt2010pas.de.spam at yahoo.fr
Fri Mar 21 05:55:05 EST 2003
Steven Taschuk wrote:
> You should also check the data in urlopen(foo).info() for a
> Content-Type header; the value of that header is supposed
> to take precedence over either of the above.
hello,
thanks, but i'm still confused... here is an example :
example.html :
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
é : é<br />
à : Ã <br />
</body>
</html>
and in python i do :
>>> import urllib
>>> sock = urllib.urlopen('http://192.168.0.1/example.html')
>>> sock.info().getencoding()
'7bit'
that is my file example.html is actually encoded using 7bit (which seems ok
to me as some utf-8 need two chars like é) but if i need to "understand"
what it contains i still must check for <content-type>...
>>> data = sock.read()
>>> sock.close()
>>> unicode(data, "utf8").encode("cp850")
print unicode(data, "utf8").encode("iso-8859-1")
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
é : é<br />
à : à<br />
</body>
</html>
and finally, never use this '7bit' encoding retrieved from
sock.info().getencoding(),
am i missing something ???
thanks,
More information about the Python-list
mailing list