how to detect the character encoding in a web page ?

Kwpolska kwpolska at gmail.com
Mon Dec 24 07:16:16 EST 2012


On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller at gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $

And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

or

    <meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end.  But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.

-- 
Kwpolska <http://kwpolska.tk>
stop html mail      | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16



More information about the Python-list mailing list