[Python-3000] Unicode and OS strings

Fri Sep 14 08:02:56 CEST 2007

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

 >> This means that a way of handling such points is very useful, and
 >> as long as there's enough PUA space, the approach I suggested can
 >> handle all of these various issues.

 > PUA already has a representation in UTF-8, so this is more incompatible
 > with UTF-8 than needed,

Hm?  It's not incompatible at all, and we're not interested in a
representation in UTF-8, but rather in UTF-16 (ie, the Python internal
encoding).  And it *is* needed, because these characters by assumption
are not present in Unicode at all.  (More precisely, they may be
present, but the tables we happen to have don't have mappings for
them.)

 > and hijacks characters

No, it doesn't.  As I responded to Greg Ewing, there is an issue about
things like pickling which use Python internal representations, but
not for anything which normally communicates with Python through
codecs.

 > which might be used (for example I'm using some PUA ranges for
 > encoding my script, they are being transported between processes,
 > and I would be upset if some language had mangled them to something
 > else).

Your escaping proposal *guarantees* mangling because it turns
characters into tuples of code units; it does not preserve character
set information.  It only works for you because you only have one
private script you care about, so you know what those code units mean.

If we don't have character set information, then of course that's the
best you can do, and my proposal will do something equivalent.  But if
we *do* have character set information, then my proposal is far more
powerful.  It allows us to process PUA characters as characters (ie,
put them in strings, slice and dice, merge and meld) with some hope of
recovering the character's semantics after many transformations of the
containing string.

In any case, it would not be hard to create an API allowing a Python
program to "reserve" a block in a PUA.  You still have the issue of
collision among multiple applications wanting the same block, of
course.  You may be able to guarantee that will never happen in your
application, but there are examples of OSes that assigned characters
in the PUA (Mac OS and Microsoft Windows both did so at one time or
another, although they may not be doing it currently, I haven't
checked).

 > While U+0000 is also representable in UTF-8, it cannot occur in
 > filenames, program arguments, environment variables etc., in many
 > contexts it was free.

In your experience, and mine, but is it in POSIX?  If not, I'd rather
not add the restriction, no matter how harmless it seems in practice.
(Of course practicality beats purity, but your proposal has many other
defects, too.)

I'm also very bothered by the fact that the interpretation of U+0000
differs in different contexts in your proposal.  As I'm sure you know,
the semantics of mixing codecs with different semantics (specifically,
the treatment of particular code units) is very hairy.  Once you get a
string into Python, you normally no longer know where it came from,
but now whether something came from the program argument or
environment or from a stdio stream changes the semantics of U+0000.
For me personally, that's a very good reason to object to your
proposal.

 > Of course my escaping scheme can preserve \0 too, by escaping it to
 > U+0000 U+0000, but here it's incompatible with the real UTF-8.

No.  It's *never* compatible with UTF-8 because it assigns a different
meaning to U+0000 from ASCII NUL.

Your scheme also suffers from the practical problem that strings
containing escapes are no longer arrays of characters.  One effect of
my scheme is to extend the "string is array" model to any application
that doesn't need to treat more non-BMP characters than there is space
available in the PUA.  Once implemented, it could easily be adapted to
handle characters in Planes 1-16, thus avoiding any use of surrogates
in the vast majority of cases.