how to detect the character encoding in a web page ?
Kwpolska
kwpolska at gmail.com
Mon Dec 24 07:16:16 EST 2012
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller at gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $
And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is
<meta http-equiv="content-type" content="text/html; charset=utf-8">
or
<meta charset="utf-8">
The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end. But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.
In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.
--
Kwpolska <http://kwpolska.tk>
stop html mail | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16
More information about the Python-list
mailing list