[Python-Dev] PEP 383 update: utf8b is now the error handler

Tue May 5 17:18:29 CEST 2009

On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>
> 2.  The specification should state, and the discussion emphasize, that
>    strings which were produced by surrogate replacement *must not* be
>    used in data interchange with systems that do not specifically
>    accept such strings, and that this is the responsibility of the
>    application.[2]

That sounds like a useful statement to make.  How would an application
make sure that they were producing only valid unicode?  How about add
an option to os.listdir() named "errors" with default value 'utf8b'
(or 'surrogate-replace', or whatever the name is)?  Then applications
which need to produce only valid unicode strings could pass
errors=strict, errors=ignore, or errors=replace?  (If anyone really
wants behavior like Python 3.0 then we could perhaps also add a new
one just for os.listdir() named errors=skipfilename.)

My most recent plan for Tahoe, as of the letter that I sent last
night, is to emulate the behavior of Nautilus and GNU ls by using the
'replace' error handler and (emulating Nautilus) to append " (invalid
encoding)" to the end of the string.  (screenshot:
http://zooko.com/Nautilus_vs_invalid_encoding.png )

So if I could ask os.listdir to return filenames with U+FFFD in place
of undecodable characters, then I could subsequently do something
like:

for f in os.listdir(d, errors='replace'):
    if u"\ufffd" in f:
        f += " (invalid encoding)"

(On top of that I would have to check for collisions, but that's out of scope.)

Regards,

Zooko