how to detect the character encoding in a web page ?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Dec 24 08:50:39 EST 2012


On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:

> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> <kurt.alfred.mueller at gmail.com> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
> 
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RIGHT thing to do for websites is detect the meta charset definition,
> which is
> 
>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> or
> 
>     <meta charset="utf-8">
> 
> The second one for HTML5 websites, and both may require case conversion
> and the useless ` /` at the end.  But if somebody is using HTML5, you
> are pretty much guaranteed to get UTF-8.
> 
> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
> Because nobody in the right mind would use something else today.

Alas, there are many, many, many, MANY websites that are created by 
people who are *not* in their right mind. To say nothing of 15 year old 
websites that use a legacy encoding. And to support those, you may need 
to guess the encoding, and for that, chardetect.py is the solution.


-- 
Steven



More information about the Python-list mailing list