[Python-Dev] PEP 383 and GUI libraries

Fri May 1 07:44:36 CEST 2009

Folks:

My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
binary names from the filesystem and store them so that I can regenerate
the same byte string later, but it also requires that I *know* whether
what I got was a valid string in the expected encoding (which might be
utf-8) or whether it was not and I need to fall back to storing the
bytes.  So far, it looks like PEP 383 doesn't provide both of these
requirements, so I am going to have to continue working-around the
Python API even after PEP 383.  In fact, it might actually increase the
amount of working-around that I have to do.

If I understand correctly, .decode(encoding, 'strict') will not be
changed by PEP 383.  A new error handler is added, so .decode('utf-8',
'python-escape') performs the utf-8b decoding.  Am I right so far?
Therefore if I have a string of bytes, I can attempt to decode it with
'strict', and if that fails I can set the flag showing that it was not a
valid byte string in the expected encoding, and then I can invoke
.decode('utf-8', 'python-escape') on it.  So far, so good.

(Note that I never want to do .decode(expected_encoding,
'python-escape') -- if it wasn't a valid bytestring in the
expected_encoding, then I want to decode it with utf-8b, regardless of
what the expected encoding was.)

Anyway, I can use it like this:

class FName:
    def __init__(self, name, failed_decode=False):
        self.name = name
        self.failed_decode = failed_decode

def fs_to_unicode(bytes):
    try:
        return FName(bytes.decode(sys.getfilesystemencoding(), 'strict'))
    except UnicodeDecodeError:
        return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)

And what about unicode-oriented APIs such as os.listdir()?  Uh-oh, the
PEP says that on systems with locale 'utf-8', it will automatically be
changed to 'utf-8b'.  This means I can't reliably find out whether the
entries in the directory *were* named with valid encodings in utf-8?
That's not acceptable for my use case.  I would have to refrain from
using the unicode-oriented os.listdir() on POSIX, and instead do
something like this:

if platform.system() in ('Windows', 'Darwin'):
    def listdir(d):
        return [FName(n) for n in os.listdir(d)]
elif platform.system() in ('Linux', 'SunOs'):
    def listdir(d):
        bytesd = d.encode(sys.getfilesystemencoding())
        return [fs_to_unicode(n) for n in os.listdir(bytesd)]
else:
    raise NotImplementedError("Please classify platform.system() == %s \
as either unicode-safe or unicode-unsafe." % platform.system())

In fact, if 'utf-8' gets automatically converted to 'utf-8b' when
*decoding* as well as encoding, then I would have to change my
fs_to_unicode() function to check for that and make sure to use strict
utf-8 in the first attempt:

def fs_to_unicode(bytes):
    fse = sys.getfilesystemencoding()
    if fse == 'utf-8b':
        fse = 'utf-8'
    try:
        return FName(bytes.decode(fse, 'strict'))
    except UnicodeDecodeError:
        return FName(fn.decode('utf-8', 'python-escape'),
                     failed_decode=True)

Would it be possible for Python unicode objects to have a flag
indicating whether the 'python-escape' error handler was present?  That
would serve the same purpose as my "failed_decode" flag above, and would
basically allow me to use the Python APIs directory and make all this
work-around code disappear.

Failing that, I can't see any way to use the os.listdir() in its
unicode-oriented mode to satisfy Tahoe's requirements.

If you take the above code and then add the fact that you want to use
the failed_decode flag when *encoding* the d argument to os.listdir(),
then you get this code: [2].

Oh, I just realized that I *could* use the PEP 383 os.listdir(), like
this:

def listdir(d):
    fse = sys.getfilesystemencoding()
    if fse == 'utf-8b':
        fse = 'utf-8'
    ns = []
    for fn in os.listdir(d):
        bytes = fn.encode(fse, 'python-escape')
        try:
            ns.append(FName(bytes.decode(fse, 'strict')))
        except UnicodeDecodeError:
            ns.append(FName(fn.decode('utf-8', 'python-escape'),
                      failed_decode=True))
    return ns

(And I guess I could define listdir() like this only on the
non-unicode-safe platforms, as above.)

However, that strikes me as even more horrible than the previous
"listdir()" work-around, in part because it means decoding, re-encoding,
and re-decoding every name, so I think I would stick with the previous
version.

Oh, one more note: for Tahoe's purposes you can, in all of the code
above, replace ".decode('utf-8', 'python-replace')" with
".decode('windows-1252')" and it works just as well.  While UTF-8b seems
like a really cool hack, and it would produce more legible results if
utf-8-encoded strings were partially corrupted, I guess I should just
use 'windows-1252' which is already implemented in Python 2 (as well as
in all other software in the world).

I guess this means that PEP 383, which I have approved of and liked so
far in this discussion, would actually not help Tahoe at all and would
in fact harm Tahoe -- I would have to remember to detect and work-around
the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python
3.

If anyone else has a concrete, real use case which would be helped by
PEP 383, I would like to hear about it.  Perhaps Tahoe can learn
something from it.

Oh, if this PEP could be extended to add a flag to each unicode object
indicating whether it was created with the python-escape handler or not,
then it would be useful to me.

Regards,

Zooko

[1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html
[2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py