latin1 and cp1252 inconsistent?

buck at yelp.com buck at yelp.com
Fri Nov 16 16:44:03 EST 2012


Latin1 has a block of 32 undefined characters.
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

The byte 0x81 decoded with latin gives the unicode 0x81.
Decoding the same byte with windows-1252 yields a stack trace with `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>`

This seems inconsistent to me, given that this byte is equally undefined in the two standards.

Also, the html5 standard says:

When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0


The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.



More information about the Python-list mailing list