[Python-Dev] PEP 383 update: utf8b is now the error handler

"Martin v. Löwis" martin at v.loewis.de
Thu May 7 08:10:16 CEST 2009


> By the way, what are the ASCII characters that are not suppported by Shift-JIS?
> Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
> backslash and the tilde).

The problem with this encoding is that bytes below 128 appear as second
bytes of a two-byte encoding:

py> "\x81@".decode("shift-jis")
u'\u3000'
py> "\x81A".decode("shift-jis")
u'\u3001'

So in on decoding, it may be the second byte (i.e. the ASCII byte) that
causes a problem:

py> "\x81/".decode("shift-jis")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position
0-1: illegal multibyte sequence

For the shift-jis codec, that's actually not a problem, though:

py> b"\x81/".decode("shift-jis","utf8b")
'\udc81/'

so the utf8b error handler will escape the first of the two bytes,
and then pass the second byte to the codec again, which then decodes
as ASCII.

Regards,
Martin


More information about the Python-Dev mailing list