how to detect the character encoding in a web page ?

Albert van der Horst albert at spenarnc.xs4all.nl
Mon Jan 14 07:50:23 EST 2013


In article <roy-DF05DA.11460324122012 at news.panix.com>,
Roy Smith  <roy at panix.com> wrote:
>In article <rn%Bs.693798$nB6.605938 at fx21.am4>,
> Alister <alister.ware at ntlworld.com> wrote:
>
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>>
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then use chardectect only if these
>> fail to provide any result.
>
>I agree that checking the metadata is the right thing to do.  But, I
>wouldn't go so far as to assume it will always be correct.  There's a
>lot of crap out there with perfectly formed metadata which just happens
>to be wrong.
>
>Although it pains me greatly to quote Ronald Reagan as a source of
>wisdom, I have to admit he got it right with "Trust, but verify".  It's

Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.

>the only way to survive in the unicode world.  Write defensive code.
>Wrap try blocks around calls that might raise exceptions if the external
>data is borked w/r/t what the metadata claims it should be.

The way to go, of course.

Groetjes Albert
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert at spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst




More information about the Python-list mailing list