[Python-3000] Unicode and OS strings

Tue Sep 18 22:36:41 CEST 2007

>>>>> "Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

 >>  > This is wrong: UTF-8 is specified for PUA. PUA is no special from the
 >>  > point of view of UTF-8.
 >
 >> It is from the point of view of the Unicode standard, specifically v5.
 >> Please see section 16.5, especially about the "corporate use subarea".
 >
 > It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already
 > specified for PUA.

There's no UTF-8 in Python's internal string encoding.  What are you
talking about?

 >> Sure, and what I propose is entirely compatible with the specification
 >> of UTF-8 as a UTF,
 >
 > It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it
 > might be encoded as a single byte.

Of course not; the point of the proposal is to ensure that all text
can be round-tripped through Python's internal representation.
Anything that comes in as a character through a codec using my
exception handler will be the same character when output with that
handler.  Again, what are you talking about?

 >> While I'm uncomfortable advocating the position that my proposal is
 >> entirely compatible with C10,
 >
 > It is not. Elements of PUA are characters.

Yes.  Where did I say anything else?

 > It's not the same, but interpreting as characters in PUA is obviously
 > interpreting as characters.

No.  Internally mapping to characters in PUA is mapping.  Unicode does
not try to restrict internal processing, only behavior at process
boundaries.  Interpretation as characters happens only on output.

I do not yet know how to prevent that (or even if I can, it may be
practically impossible because of important cases where the internal
representation is exchanged between processes).  If it can't be
prevented while maintaining efficiency, that is a major flaw (but not
necessarily fatal, since I'm proposing an exception handler, not a
required feature of Unicode codecs).

 > I meant Python3 where sys.argv is a list of Unicode strings. It should
 > work out of the box.

I really don't think so.  Exposing internal representations as you are
doing here is your problem; it is not something that Python should
attempt to guarantee will work.

More troublesome from your point of view, Guido has stated that the
internal representation used by Python strings is a sequence of
Unicode code units, not characters.  I don't think that's reached the
status of "pronouncement" yet, but you will probably need a PEP to get
the guarantees you want.

 > Why length 6? "\ue650" encoded in UTF-8 has length 3.

MS UTF-8, I suppose.  You see, you simply cannot depend on any
particular Python string being translated to a particular Unicode
representation unless you choose the codec explicitly.  Since you have
to specify that codec to be reliable anyway, I don't see much loss
here except to lazy programmers willing to live dangerously.  But
that's not true of anybody in this thread!  The whole point is to
preserve even broken input for later forensic analysis.

 > For an old discussion about using PUA to represent bytes undecodable
 > as UTF-8, see http://www.mail-archive.com/unicode@unicode.org/ and
 > subthreads with "roundtripping" in the subject.

Which (after a half hour of looking) are mostly irrelevant, because
Mr. Kristan's proposal (I assume that's what you're talking about) as
far as I can see involved standardizing such representations within
Unicode.  We're not talking about that here; we're talking about
representations internal to Python, for the convenience of Python
users.