[Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue

Wed Oct 1 00:33:50 CEST 2008

On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>>> My concern still is that it brings the bytes type into the status of
>>> another character string type, which is really bad, and will require
>>> further modifications to Python for the lifetime of 3.x.
>>
>> I'd like to understand why this is "really bad". I though it was by
>> design that the str and bytes types behave pretty similarly. You can
>> use both as dict keys.
>
> If they have to behave pretty similarly, they have to be supported in
> all APIs that deal with text.

I don't see how you get from "pretty similarly" to "all APIs". :-)

> For example, people will demand that
> printing bytes should just copy them onto the stream (rather than
> invoking repr()), and writing them onto a text stream should work the
> same way. GUI library should support them, the XML libraries, and so
> on.
>
> Where will you stop, and tell people that bytes are just not supposed
> to do this or that?

Printing a bytes object already works, and displays its repr(), which
is guaranteed to be pure ASCII (unlike the repr() of a unicode str
object in Py3k). All the others you mention will cause breakage as
they should -- these errors exist to force the programmer to think
about encodings or conversions. I don't see that as a big burden
because the only way there could be bytes here in the first place is
when the user explicitly requested bytes. A program that only ever
passes text strings to the os module is only ever going to get text
strings back.

>>> This is because applications will then regularly use byte strings for
>>> file names on Unix, and regular strings on Windows, and then expect
>>> the program to work the same without further modifications.
>>
>> It seems that bytes arguments actually *do* work on Windows -- somehow
>> they get decoded. (Unless Terry's report was from 2.x.)
>
> To a limited degree - see my other message. Don't try to listdir a
> directory with characters outside CP_ACP (it will give you invalid
> file names).

Understood.

>> Actually something like that may not be a bad idea. Ian Bicking's
>> webob supports similar double APIs for getting the request parameters
>> out of a request object; I believe request.GET['x'] is a text object
>> and request.GET_str['x'] is the corresponding uninterpreted bytes
>> sequence. I would prefer to have os.environb over os.environ[b"PATH"]
>> though.
>
> And would you keep them synchronized?

Yes, the bytes versions would be the canonical version and the str
version would wrap around that -- though updating the str version
would also update the bytes version. Some keys would be missing from
the str version (or perhaps they would raise exceptions or default to
some other error handler, like ignore or replace).

>> I assume at some point we can stop and have sufficiently low-level
>> interfaces that everyone can agree are in bytes only. Bytes aren't
>> going away. How does Java deal with this? Its File class doesn't seem
>> to deal in bytes at all. What would its listFiles() method do with
>> undecodable filenames?
>
> Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte
> sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will
> fail with FileNotFoundException.
>
> IOW, Java hasn't solved the problem in the last 10 years. Marcin
> Kowalczyk did a more thorough analysis about a year ago in
>
> http://mail.python.org/pipermail/python-3000/2007-September/010450.html

I can't say I like the Java solution. I would like to be able to write
a robust backup tool in Python, even if the code needed to make it
work everywhere isn't going to win any prizes (due to the need to use
bytes on Unix, str on Windows).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)