[I18n-sig] Unicode strings: an alternative

Just van Rossum just@letterror.com
Thu, 4 May 2000 08:42:00 +0100


(Thanks for all the comments. I'll condense my replies into one post.)

[JvR]
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.

[Tom Emerson wrote]
>I disagree with you here... store them as UTF-8.

Erm, utf-8 in a wide string? This makes no sense...

[Skip Montanaro]
>Presumably, with Just's proposal len() would
>simply return ob_size/width.

Right. And if you would allow values for width other than 1 and 2, it opens
the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and
"only" width==1 needs to be special-cased for speed.

>If you used a variable width encoding, Just's plan wouldn't work.

Correct, but nor does the current unicode object. Variable width encodings
are too messy to see as strings at all: they are only useful as byte arrays.

[GvR]
>This seems to have some nice properties, but I think it would cause
>problems for existing C code that tries to *interpret* the bytes of a
>string: it could very well do the wrong thing for wide strings (since
>old C code doesn't check for the "wide" flag).  I'm not sure how much
>C code there is that merely passes strings along...  Most C code using
>strings makes use of the strings (e.g. open() falls in this category
>in my eyes).

There are probably many cases that fall into this category. But then again,
these cases, especially those that potentially can deal with other
encodings than ascii, are not much helped by a default encoding, as /F
showed.

My idea arose after yesterday's discussions. Some quotes, plus comments:

[GvR]
>However the problem is that print *always* first converts the object
>using str(), and str() enforces that the result is an 8-bit string.
>I'm afraid that loosening this will break too much code.  (This all
>really happens at the C level.)

Guido goes on to explain that this means utf-8 is the only sensible default
in this case. Good reasoning, but I think it's backwards:
- str(unicodestring) should just return unicodestring
- it is important that stdout receives the original unicode object.

[MAL]
>BTW, __str__() has to return strings too. Perhaps we
>need __unicode__() and a corresponding slot function too ?!

This also seems backwards. If it's really too hard to change Python so that
__str__ can return unicode objects, my solution may help.

[Ka-Ping Yee]
>Here is an addendum that might actually make that proposal
>feasible enough (compatibility-wise) to fly in the short term:
>
>    print x
>
>does, conceptually:
>
>    try:
>        sys.stdout.printout(x)
>    except AttributeError:
>        sys.stdout.write(str(x))
>        sys.stdout.write("\n")

That stuff like this is even being *proposed* (not that it's not smart or
anything...) means there's a terrible bottleneck somewhere which needs
fixing. My proposal seems to do does that nicely.

Of course, there's no such thing as a free lunch, and I'm sure there are
other corners that'll need fixing, but it appears having to write

    if (!PyString_Check(doc) && !PyUnicode_Check(doc))
        ...

in all places that may accept unicode strings is no fun either.

Yes, some code will break if you throw a wide string at it, but I think
that code is easier repaired with my proposal than with the current
implementation.

It's a big advantage to have only one string type; it makes many problems
we've been discussing easier to talk about.

Just