[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 23:01:44 CEST 2009

Glenn Linderman wrote:
> On approximately 4/28/2009 11:55 AM, came the following characters from 
> the keyboard of MRAB:
>> I've been thinking of "python-escape" only in terms of UTF-8, the only
>> encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
>> decodable.
> 
> 
> UTF-8 is only mentioned in the sense of having special handling for 
> re-encoding; all the other locales/encodings are implicit.  But I also 
> went down that path to some extent.
> 
> 
>> But if you're talking about using it with other encodings, eg
>> shift-jisx0213, then I'd suggest the following:
>>
>> 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
>> half surrogates U+DC00 to U+DCFF.
> 
> 
> This makes 256 different escape codes.
> 
> 
Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).

>> 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
>> are treated as though they are undecodable bytes.
> 
> 
> This provides escaping for the 256 different escape codes, which is 
> lacking from the PEP.
> 
> 
>> 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
>> are encoded to bytes 0x00 to 0xFF.
> 
> 
> This reverses the escaping.
> 
> 
>> 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
>> be produced by decoding raise an exception.
> 
> 
> This is confusing.  Did you mean "excluding" instead of "including"?
> 
Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".

For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.

> 
>> I think I've covered all the possibilities. :-)
> 
> 
> You might have.  Seems like there could be a simpler scheme, though...
> 
> 1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
> or pretty much any defined Unicode codepoint outside the range U+0100 to 
> U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
> is easier for humans to comprehend.
> 
> 2. When the escape codepoint is decoded from the byte stream for a bytes 
> interface or found in a str on the str interface, double it.
> 
> 3. When an undecodable byte 0xPQ is found, decode to the escape 
> codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
> 
> 4. When encoding, a sequence of two escape codepoints would be encoded 
> as one escape codepoint, and a sequence of the escape codepoint followed 
> by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
> not followed by the escape codepoint, or by a codepoint in the range 
> U+0100 to U+01FF would raise an exception.
> 
> 5. Provide functions that will perform the same decoding and encoding as 
> would be done by the system calls, for both bytes and str interfaces.
> 
> 
> This differs from my previous proposal in three ways:
> 
> A. Doesn't put a marker at the beginning of the string (which I said 
> wasn't necessary even then).
> 
> B. Allows for a choice of escape codepoint, the previous proposal 
> suggested a specific one.  But the final solution will only have a 
> single one, not a user choice, but an implementation choice.
> 
> C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
> U+0000 to U+00FF.  This avoids introducing the NULL character and escape 
> characters into the decoded str representation, yet still uses 
> characters for which glyphs are commonly available, are non-combining, 
> and are easily distinguishable one from another.
> 
> Rationale:
> 
> The use of codepoints with visible glyphs makes the escaped string 
> friendlier to display systems, and to people.  I still recommend using 
> U+003F as the escape codepoint, but certainly one with a typcially 
> visible glyph available.  This avoids what I consider to be an annoyance 
> with the PEP, that the codepoints used are not ones that are easily 
> displayed, so endecodable names could easily result in long strings of 
> indistinguishable substitution characters.
> 
Perhaps the escape character should be U+005C. ;-)

> It, like MRAB's proposal, also avoids data puns, which is a major 
> problem with the PEP.  I consider this proposal to be easier to 
> understand than MRAB's proposal, or the PEP, because of the single 
> escape codepoint and the use of visible characters.
> 
> This proposal, like my initial one, also decodes and encodes (just the 
> escape codes) values on the str interfaces.  This is necessary to avoid 
> data puns on systems that provide both types of interfaces.
> 
> This proposal could be used for programs that use str values, and easily 
> migrates to a solution that provides an object that provides an 
> abstraction for system interfaces that have two forms.
>