[I18n-sig] Unicode strings: an alternative

Tom Emerson tree@basistech.com
Wed, 3 May 2000 16:19:05 -0400 (EDT)


Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?


 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"