how to detect the character encoding in a web page ?

Alister alister.ware at ntlworld.com
Mon Dec 24 11:27:03 EST 2012


On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote:

> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> 
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> <kurt.alfred.mueller at gmail.com> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>> 
>> And it sucks, because it uses magic, and not reading the HTML tags. The
>> RIGHT thing to do for websites is detect the meta charset definition,
>> which is
>> 
>>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
>> 
>> or
>> 
>>     <meta charset="utf-8">
>> 
>> The second one for HTML5 websites, and both may require case conversion
>> and the useless ` /` at the end.  But if somebody is using HTML5, you
>> are pretty much guaranteed to get UTF-8.
>> 
>> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
>> Because nobody in the right mind would use something else today.
> 
> Alas, there are many, many, many, MANY websites that are created by
> people who are *not* in their right mind. To say nothing of 15 year old
> websites that use a legacy encoding. And to support those, you may need
> to guess the encoding, and for that, chardetect.py is the solution.

Indeed due to the poor quality of most websites it is not possible to be 
100% accurate for all sites.

personally I would start by checking the doc type & then the meta data as 
these should be quick & correct, I then use chardectect only if these 
fail to provide any result.


-- 
I have found little that is good about human beings.  In my experience
most of them are trash.
		-- Sigmund Freud



More information about the Python-list mailing list