[Python-Dev] PEP 383 (again)
"Martin v. Löwis"
martin at v.loewis.de
Tue Apr 28 08:59:19 CEST 2009
> PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode
> strings in a reversible way.
That isn't really true; it is not, inherently, about UTF-8.
Instead, it tries to represent non-filesystem-encoding byte sequence
in Unicode strings in a reversible way.
> Quietly escaping a bad UTF-8 encoding with private Unicode characters is
> unlikely to be the right thing
And indeed, the PEP stopped using PUA characters.
> Therefore, when Python encounters path names on a file system
> that are not consistent with the (assumed) encoding for that file
> system, Python should raise an error.
This is what happens currently, and users are quite unhappy about it.
> If you really don't care what the string looks like and you just want an
> encoding that round-trips without loss, you can probably just set your
> encoding to one of the 8 bit encodings, like ISO 8859-15. Decoding
> arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
> correct than decoding them as the proposed "utf-8b". In fact, the most
> likely source of non-UTF-8 sequences is ISO 8859 encodings.
Yes, users can do that (to a degree), but they are still unhappy about
it. The approach actually fails for command line arguments
> As for what the byte-oriented interfaces should do, they are simply
> platform dependent. On UNIX, they should do the obvious thing. On
> Windows, they can either hook up to the low-level byte-oriented system
> calls that the systems supply, or Windows could fake it and have the
> byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8
> sequences as illegal (there are already many illegal byte sequences
> anyway).
As is, these interfaces are incomplete - they don't support command
line arguments, or environment variables. If you want to complete them,
you should write a PEP.
Regards,
Martin
More information about the Python-Dev
mailing list