[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 22:16:34 CEST 2009


On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:
> I've been thinking of "python-escape" only in terms of UTF-8, the only
> encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
> decodable.


UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.


> But if you're talking about using it with other encodings, eg
> shift-jisx0213, then I'd suggest the following:
> 
> 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
> half surrogates U+DC00 to U+DCFF.


This makes 256 different escape codes.


> 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
> are treated as though they are undecodable bytes.


This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.


> 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
> are encoded to bytes 0x00 to 0xFF.


This reverses the escaping.


> 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
> be produced by decoding raise an exception.


This is confusing.  Did you mean "excluding" instead of "including"?


> I think I've covered all the possibilities. :-)


You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.

2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.

3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.

4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.

5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.


This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).

B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.

C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+0000 to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.

Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.

It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.

This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.

This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


More information about the Python-Dev mailing list