[Python-Dev] PEP 383 (again)

Tue Apr 28 08:59:19 CEST 2009

> PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode
> strings in a reversible way.

That isn't really true; it is not, inherently, about UTF-8.
Instead, it tries to represent non-filesystem-encoding byte sequence
in Unicode strings in a reversible way.

> Quietly escaping a bad UTF-8 encoding with private Unicode characters is
> unlikely to be the right thing

And indeed, the PEP stopped using PUA characters.

> Therefore, when Python encounters path names on a file system
> that are not consistent with the (assumed) encoding for that file
> system, Python should raise an error. 

This is what happens currently, and users are quite unhappy about it.

> If you really don't care what the string looks like and you just want an
> encoding that round-trips without loss, you can probably just set your
> encoding to one of the 8 bit encodings, like ISO 8859-15.   Decoding
> arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
> correct than decoding them as the proposed "utf-8b".  In fact, the most
> likely source of non-UTF-8 sequences is ISO 8859 encodings.

Yes, users can do that (to a degree), but they are still unhappy about
it. The approach actually fails for command line arguments

> As for what the byte-oriented interfaces should do, they are simply
> platform dependent.  On UNIX, they should do the obvious thing.  On
> Windows, they can either hook up to the low-level byte-oriented system
> calls that the systems supply, or Windows could fake it and have the
> byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8
> sequences as illegal (there are already many illegal byte sequences
> anyway).

As is, these interfaces are incomplete - they don't support command
line arguments, or environment variables. If you want to complete them,
you should write a PEP.

Regards,
Martin