latin1 and cp1252 inconsistent?

buck at yelp.com buck at yelp.com
Sat Nov 17 11:56:46 EST 2012


On Friday, November 16, 2012 4:33:14 PM UTC-8, Nobody wrote:
> On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:
> IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
> successful and now we have to deal with it. If HTML content is tagged as
> using ISO-8859-1, it's more likely that it's actually Windows-1252 content
> generated by someone who doesn't know the difference.

Yes that's exactly what it says.

> Given that the only differences between the two are for code points which
> are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
> ISO-8859-1 as Windows-1252 should be harmless.

"should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt

and here:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls), 
> respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
> equal to its corresponding Unicode code point.

IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.

This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)



More information about the Python-list mailing list