[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 29 11:09:09 CEST 2009

Glenn Linderman a écrit :

> 
> If there is going to be a required transformation from de novo strings 
> to funny-encoded strings, then why not make one that people can actually 
> see and compare and decode from the displayable form, by using 
> displayable characters instead of lone surrogates?
> 

The problem with your "escape character" scheme is that the meaning is lost with 
slicing of the strings, which is a very common operation.

>>
>> I though half-surrogates were illegal in well formed Unicode. I confess
>> to being weak in this area. By "legitimate" above I meant things like
>> half-surrogates which, like quarks, should not occur alone?
>>   
> 
> "Illegal" just means violating the accepted rules.  In this case, the 
> accepted rules are those enforced by the file system (at the bytes or 
> str API levels), and by Python (for the str manipulations).  None of 
> those rules outlaw lone surrogates.  [...]
> 

Python could as well *specify* that lone surrogates are illegal, as their 
meaning is undefined by Unicode. If this rule is respected language-wise, there 
is no ambiguity. It might be unrealistic on windows, though.

This rule could even be specified only for strings that represent filesystem 
paths. Sure, they are the same type as other strings, but the programmer usually 
knows if a given string is intended to be a path or not.

Baptiste