latin1 and cp1252 inconsistent?

Nobody nobody at nowhere.com
Sat Nov 17 14:15:15 EST 2012


On Sat, 17 Nov 2012 08:56:46 -0800, buck wrote:

>> Given that the only differences between the two are for code points
>> which are in the C1 range (0x80-0x9F), which should never occur in HTML,
>> parsing ISO-8859-1 as Windows-1252 should be harmless.
> 
> "should" is a wish. The reality is that documents (and especially URLs)
> exist that can be decoded with latin1, but will backtrace with cp1252.

In which case, they're probably neither ISO-8859-1 nor Windows-1252, but
some other (unknown) encoding which has acquired the ISO-8859-1 label
"by default".

In that situation, if you still need to know the encoding, you need to
resort to heuristics such as those employed by the chardet library.




More information about the Python-list mailing list