latin1 and cp1252 inconsistent?

buck at yelp.com buck at yelp.com
Fri Nov 16 18:27:54 EST 2012


On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
> On Fri, Nov 16, 2012 at 2:44 PM,  <buck> wrote:
> 
> > Latin1 has a block of 32 undefined characters.
> 
> 
> These characters are not undefined.  0x80-0x9f are the C1 control
> codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
> their Unicode mappings are well defined.

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
    to bit combinations that do not represent graphic
    characters. Their use is outside the scope of
    ISO/IEC 8859; it is specified in other International
    Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.


> You can use a non-strict error handling scheme to prevent the error.
> >>> b'hello \x81 world'.decode('cp1252', 'replace')
> 'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.



More information about the Python-list mailing list