[Python-3000] Unicode and OS strings

Sat Sep 15 05:44:05 CEST 2007

Greg Ewing writes:

 > Stephen J. Turnbull wrote:
 > > You chose the context of round-tripping *across
 > > encodings*, not me.  Please stick with your context.
 > 
 > Maybe we have different ideas of what the problem is.  I thought
 > the problem is to take arbitrary byte sequences coming in as
 > command-line args and represent them as unicode strings in such a
 > way that the can be losslessly converted back into the same byte
 > strings.

That's a straw man if taken literally.  Just use the ISO-8859-1 codec,
and you're done.

If you add the condition that the encoding is known with certainty and
the source string is well-formed for that encoding, then you need to
decode to meaningful Unicode.  For that problem, James Knight's
solution is good if it makes sense to assume that the sequence of
bytes is encoded in UTF-8 Unicode.  However, I don't think that is a
reasonable assumption for a language that is heavily used in Europe
and Japan, and for processing email.  These are contexts where UTF-8
is making steady progress, but legacy encodings are still quite
important.

However, the general problem is to decode a sequence of bytes into a
Unicode string and be able to recover the original sequence if you
decide you got it wrong, even after you've sliced and concatenated the
string with other strings.  With no guarantee that all the source
encodings where the same.

 > I was just pointing out that if you do this in a way that involves
 > some sort of dynamically generated mapping, then it won't work if
 > the round trip spans more than one Python session -- and that there
 > are any number of ways that the data could get from one session to
 > another, many of them not involving anything that one would
 > recognise as a unicode encoding in the conventional sense.

But it also won't work if you just pass around strings that are
invertible to byte sequences, *because recipients don't know which
byte sequence to invert them to*.  Is that cruft corrupt EUC-JP or
corrupt Shift JIS or corrupt UTF-8?  Or maybe simply a valid character
which is even a Unicode character, but not in the table for the source
encoding (this happens in Japanese all the time)?  You're likely to
make different guesses about what was intended by a specific sequence
of byte cruft for different original encodings.

What I'm suggesting is to provide a way for processes to record and
communicate that information without needing to provide a "source
encoding" slot for strings, and which is able to handle strings
containing unrecognized (including corrupt) characters from multiple
source encodings.  True, it will be up to the applications to
communicate that information, but it is, anyway.

Furthermore, the same algorithms can be used to "fold" any text that
contains only BMP characters plus no more than 6400 distinct non-BMP
characters into the BMP, which would be a nice feature for people
wanting to avoid the UTF-16 surrogates for some reason.

As Martin points out, it may not be possible to implement this without
changing the codecs one by one (I have some hope that it can
nevertheless be done, but haven't looked at the codec framework
closely yet).  I think it would be unfortunate if we're going to try
to solve a small subset of these problems (as James and Marcin are
doing) to overlook the possibility of a good solution to a whole bunch
of related problems.