[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
MRAB
google at mrabarnett.plus.com
Tue Apr 28 23:01:44 CEST 2009
Glenn Linderman wrote:
> On approximately 4/28/2009 11:55 AM, came the following characters from
> the keyboard of MRAB:
>> I've been thinking of "python-escape" only in terms of UTF-8, the only
>> encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
>> decodable.
>
>
> UTF-8 is only mentioned in the sense of having special handling for
> re-encoding; all the other locales/encodings are implicit. But I also
> went down that path to some extent.
>
>
>> But if you're talking about using it with other encodings, eg
>> shift-jisx0213, then I'd suggest the following:
>>
>> 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
>> half surrogates U+DC00 to U+DCFF.
>
>
> This makes 256 different escape codes.
>
>
Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).
>> 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
>> are treated as though they are undecodable bytes.
>
>
> This provides escaping for the 256 different escape codes, which is
> lacking from the PEP.
>
>
>> 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
>> are encoded to bytes 0x00 to 0xFF.
>
>
> This reverses the escaping.
>
>
>> 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
>> be produced by decoding raise an exception.
>
>
> This is confusing. Did you mean "excluding" instead of "including"?
>
Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".
For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.
>
>> I think I've covered all the possibilities. :-)
>
>
> You might have. Seems like there could be a simpler scheme, though...
>
> 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817
> or pretty much any defined Unicode codepoint outside the range U+0100 to
> U+01FF (see rule 3 for why). Only one escape codepoint is needed, this
> is easier for humans to comprehend.
>
> 2. When the escape codepoint is decoded from the byte stream for a bytes
> interface or found in a str on the str interface, double it.
>
> 3. When an undecodable byte 0xPQ is found, decode to the escape
> codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
>
> 4. When encoding, a sequence of two escape codepoints would be encoded
> as one escape codepoint, and a sequence of the escape codepoint followed
> by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints
> not followed by the escape codepoint, or by a codepoint in the range
> U+0100 to U+01FF would raise an exception.
>
> 5. Provide functions that will perform the same decoding and encoding as
> would be done by the system calls, for both bytes and str interfaces.
>
>
> This differs from my previous proposal in three ways:
>
> A. Doesn't put a marker at the beginning of the string (which I said
> wasn't necessary even then).
>
> B. Allows for a choice of escape codepoint, the previous proposal
> suggested a specific one. But the final solution will only have a
> single one, not a user choice, but an implementation choice.
>
> C. Uses the range U+0100 to U+01FF for the escape codes, rather than
> U+0000 to U+00FF. This avoids introducing the NULL character and escape
> characters into the decoded str representation, yet still uses
> characters for which glyphs are commonly available, are non-combining,
> and are easily distinguishable one from another.
>
> Rationale:
>
> The use of codepoints with visible glyphs makes the escaped string
> friendlier to display systems, and to people. I still recommend using
> U+003F as the escape codepoint, but certainly one with a typcially
> visible glyph available. This avoids what I consider to be an annoyance
> with the PEP, that the codepoints used are not ones that are easily
> displayed, so endecodable names could easily result in long strings of
> indistinguishable substitution characters.
>
Perhaps the escape character should be U+005C. ;-)
> It, like MRAB's proposal, also avoids data puns, which is a major
> problem with the PEP. I consider this proposal to be easier to
> understand than MRAB's proposal, or the PEP, because of the single
> escape codepoint and the use of visible characters.
>
> This proposal, like my initial one, also decodes and encodes (just the
> escape codes) values on the str interfaces. This is necessary to avoid
> data puns on systems that provide both types of interfaces.
>
> This proposal could be used for programs that use str values, and easily
> migrates to a solution that provides an object that provides an
> abstraction for system interfaces that have two forms.
>
More information about the Python-Dev
mailing list