latin1 and cp1252 inconsistent?

Sat Nov 17 13:08:49 EST 2012

On Sat, Nov 17, 2012 at 9:56 AM,  <buck at yelp.com> wrote:
> "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:
>
> http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
>
> and here:
>
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme.  It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/charset-reg/windows-1252

> This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
>
>> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
>> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
>> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
>> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
>> respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
>> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
>> equal to its corresponding Unicode code point.
>
> IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.
>
> This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

But Latin-1 explicitly defers to to the control codes for those
characters.  CP-1252 does not; the reason those characters are left
undefined is to allow for future expansion, such as when Microsoft
added the Euro sign at 0x80.

Since we're talking about conversion from bytes to Unicode, I think
the most authoritative source we could possibly reference would be the
official ISO 10646 conversion tables for the character sets in
question.  I understand those are to be found here:

http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

and here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
the cp1252 mapping leaves those five codes undefined.  This would seem
to indicate that Python is correctly decoding CP-1252 according to the
Unicode standard.