[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 30 17:13:43 EDT 2009

On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:

>>>> How do get a printable unicode version of these path strings if  
>>>> they
>>>> contain none unicode data?
>>>
>>> Define "printable". One way would be to use a regular expression,
>>> replacing all codes in a certain range with a question mark.
>>
>> What I mean by printable is that the string must be valid unicode
>> that I can print to a UTF-8 console or place as text in a UTF-8
>> web page.
>>
>> I think your PEP gives me a string that will not encode to
>> valid UTF-8 that the outside of python world likes. Did I get this
>> point wrong?
>
> You are right. However, if your *only* requirement is that it should
> be printable, then this is fairly underspecified. One way to get
> a printable string would be this function
>
> def printable_string(unprintable):
>  return ""

Ha ha! Indeed this works, but I would have to try to turn enough of the
string into a reasonable hint at the name of the file so the user can
some chance of know what is being reported.

>
>
> This will always return a printable version of the input string...
>
>> In our application we are running fedora with the assumption that the
>> filenames are UTF-8. When Windows systems FTP files to our system
>> the files are in CP-1251(?) and not valid UTF-8.
>
> That would be a bug in your FTP server, no? If you want all file names
> to be UTF-8, then your FTP server should arrange for that.

Not a bug its the lack of a feature. We use ProFTPd that has just  
implemented
what is required. I forget the exact details - they are at work - when  
the ftp client
asks for the FEAT of the ftp server the server can say use UTF-8.  
Supporting
that in the server was apparently none-trivia.

>
>
>> Having an algorithm that says if its a string no problem, if its
>> a byte deal with the exceptions seems simple.
>>
>> How do I do this detection with the PEP proposal?
>> Do I end up using the byte interface and doing the utf-8 decode
>> myself?
>
> No, you should encode using the "strict" error handler, with the
> locale encoding. If the file name encodes successfully, it's correct,
> otherwise, it's broken.

O.k. I understand.

Barry