latin1 and cp1252 inconsistent?

Ian Kelly ian.g.kelly at gmail.com
Fri Nov 16 19:20:24 EST 2012


On Fri, Nov 16, 2012 at 4:27 PM,  <buck at yelp.com> wrote:
> They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
>
> """ The shaded positions in the code table correspond
>     to bit combinations that do not represent graphic
>     characters. Their use is outside the scope of
>     ISO/IEC 8859; it is specified in other International
>     Standards, for example ISO/IEC 6429.

It gets murkier than that.  I don't want to spend time hunting down
the relevant documents, so I'll just quote from Wikipedia:

"""
In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the
extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
the Internet. This map assigns the C0 and C1 control characters to the
unassigned code values thus provides for 256 characters via every
possible 8-bit value.
"""

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History

>> You can use a non-strict error handling scheme to prevent the error.
>> >>> b'hello \x81 world'.decode('cp1252', 'replace')
>> 'hello \ufffd world'
>
> This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Well, what characters would you have these bytes decode to,
considering that they're undefined?  If the string is really CP-1252,
then the presence of undefined characters in the document does not
signify "data".  They're just junk bytes, possibly indicative of data
corruption.  If on the other hand the string is really Latin-1, and
you *know* that it is Latin-1, then you should probably forget the
aliasing recommendation and just decode it as Latin-1.

Apparently this Latin-1 -> CP-1252 encoding aliasing is already
commonly performed by modern user agents.  What do IE and Firefox do
when presented with a Latin-1 encoding and undefined CP-1252 codings?



More information about the Python-list mailing list