Encoding conundrum

Wed Nov 21 08:02:43 EST 2012

On 11/21/2012 06:24 AM, danielk wrote:
> On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote:
>>> <snip>
>>
>> In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
>>
>> and either way it has no problems encoding that data for output.  In a
>>
>> Windows cmd terminal, the default terminal encoding is cp437, which
>>
>> can't support two of the three characters you mentioned above.
> It may not be able to encode those two characters but it is able to decode them.    That seems rather inconsistent (and contradictory) to me.

You encode characters (code points), but you never decode them.  You
decode bytes.  In some cases and in some encodings, the number(ord) of
the two happens to be the same, eg. for ASCII characters.  Or to pick
latin1, where the first 256 map exactly.

But to pick utf8 for example, which I use almost exclusively on Linux,
the character chr(255) is a lowercase y with a diaeresis accent.

>>> chr(255)
'ÿ'
>>> unicodedata.name(chr(255))
'LATIN SMALL LETTER Y WITH DIAERESIS'

>>> chr(255).encode()
b'\xc3\xbf'
>>> len(chr(255).encode())
2

It takes 2 bytes to encode that character.  (Since there are 1112064
possible characters, most of them take more than one byte to encode in
utf-8.  I believe the size can range up to 4 bytes.)  But naturally, the
first byte of those 2 cannot be one that's valid by itself as an encoded
character, or it'd be impossible to pick apart (decode) a byte string
starting with that one.

So, there is no character which can be encoded to a single byte 0xc3. 
In other words:

>>> bytes([253])
b'\xfd'
>>> bytes([253]).decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfd in position 0:
invalid start byte

http://encyclopedia.thefreedictionary.com/UTF-8

has a description of the encoding rules.  Note they're really just
arithmetic, rather than arbitrary.  Ranges of characters encode to
various numbers of bytes.  The main rules are that characters below 0x80
are unchanged, and no valid character encoding is a prefix to any other
valid character encoding.

Contrast that with cp437, where the particular 256 valid characters were
chosen based only on their usefulness, and many of them are above 255. 
Consequently, there must be many characters below 255 which cannot be
encoded.

-- 

DaveA