[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 22 17:00:51 EDT 2009

>> The python-escape codec is only used/meaningful if the env encoding
>> is not UTF-8. For any other encoding, it is assumed that no character
>> actually maps to the private-use characters.
> 
> Which should be true for any encoding from the pre-unicode era, but not
> for UTF-16/32 and variants.

Right. However, these can't appear as environment/file system encodings,
because they use null bytes.

>> Why would it become specific? It can work the same way for any encoding:
>> take U+F01xx, and generate the byte xx.
> 
> If any error callback emits bytes these byte sequences must be legal in
> the target encoding, which depends on the target encoding itself.

No. The whole process started with data having an *invalid* encoding
in the source encoding (which, after the roundtrip, is now the
target encoding). So the python-escape error handler deliberately
produces byte sequences that are invalid in the environment encoding
(hence the additional permission of having it produce bytes instead
of characters).

> However for the normal use of this error handler this might be
> irrelevant, because those filenames that get encoded were constructed in
> such a way that reencoding them regenerates the original byte sequence.

Exactly so. The error handler is not of much use outside this specific
scenario.

>> utf-8b is a new codec. However, the utf-8b codec is only used if the
>> env encoding would otherwise be utf-8. For utf-8b, the error handler
>> is indeed unnecessary.
> 
> Wouldn't it make more sense to be consistent how non-decodable bytes get
> decoded? I.e. should the utf-8b codec decode those bytes to PUA
> characters too (and refuse to encode then, so the error handler outputs
> them)?

Unfortunately, that won't work. If the original encoding is UTF-8, and
uses PUA characters, then, on re-encoding, it's not possible to tell
whether to encode as a PUA character, or as an invalid byte.

This was my original proposal a year ago, and people immediately
suggested that it is not at all acceptable if there is the slightest
chance of information loss. Hence the current PEP.

>>> I thought the error handler would be used for decoding.
>> It's used in both directions: for decoding, it converts \xXX to
>> U+F01XX. For encoding, U+F01XX will trigger an error, which is then
>> handled by the handler to produce \xXX.
> 
> But only for non-UTF8 encodings?

Right. For ease of use, the implementation will specify the error
handler regardless, and the recommended use for applications will
be to use the error handler regardless. For utf-8b, the error
handler will never be invoked, since all input can be converted
always.

Regards,
Martin