how to detect the character encoding in a web page ?

Roy Smith roy at panix.com
Mon Dec 24 11:46:03 EST 2012


In article <rn%Bs.693798$nB6.605938 at fx21.am4>,
 Alister <alister.ware at ntlworld.com> wrote:

> Indeed due to the poor quality of most websites it is not possible to be 
> 100% accurate for all sites.
> 
> personally I would start by checking the doc type & then the meta data as 
> these should be quick & correct, I then use chardectect only if these 
> fail to provide any result.

I agree that checking the metadata is the right thing to do.  But, I 
wouldn't go so far as to assume it will always be correct.  There's a 
lot of crap out there with perfectly formed metadata which just happens 
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of 
wisdom, I have to admit he got it right with "Trust, but verify".  It's 
the only way to survive in the unicode world.  Write defensive code.  
Wrap try blocks around calls that might raise exceptions if the external 
data is borked w/r/t what the metadata claims it should be.



More information about the Python-list mailing list