[I18n-sig] Unicode strings: an alternative

Guido van Rossum guido@python.org
Fri, 05 May 2000 10:54:16 -0400


> (Boy, is it quiet here all of a sudden ;-)

Maybe because (according to one report on NPR here) 80% of the world's
email systems are victimized by the ILOVEYOU virus?  You & I are not
affected because it's Windows specific (a visual basic script, I got a
copy mailed to me so I could have a good look :-).  Note that there
are already mutations, one of which pretends to be a joke.

> Sorry for the duplication of stuff, but I'd like to reiterate my points, to
> separate them from my implementation proposal, as that's just what it is:
> an implementation detail.
> 
> These things are important to me:
> - get rid of the Unicode-ness of wide strings, in order to
> - make narrow and wide strings as similar as possible
> - implicit conversion between narrow and wide strings should
>   happen purely on the basis of the character codes; no
>   assumption at all should be made about the encoding, ie.
>   what the character code _means_.
> - downcasting from wide to narrow may raise OverflowError if
>   there are characters in the wide string that are > 255
> - str(s) should always return s if s is a string, whether narrow
>   or wide
> - file objects need to be responsible for handling wide strings
> - the above two points should make it possible for
> - if no encoding is known, Unicode is the default, whether
>   narrow or wide
> 
> The above points seem to have the following consequences:
> - the 'u' in \uXXXX notation no longer makes much sense,
>   since it is not neccesary for the character to be a Unicode
>   code point: it's just a 2-byte int. \wXXXX might be an option.
> - the u"" notation is no longer neccesary: if a string literal
>   contains a character > 255 the string should automatically
>   become a wide string.
> - narrow strings should also have an encode() method.
> - the builtin unicode() function might be redundant if:
>   - it is possible to specify a source encoding. I'm not sure if
>     this is best done through an extra argument for encode()
>     or that it should be a new method, eg. transcode().
>   - s.encode() or s.transcode() are allowed to output a wide
>     string, as in aNarrowString.encode("UCS-2") and
>     s.transcode("Mac-Roman", "UCS-2").
> 
> My proposal to extend the "old" string type to be able to contain wide
> strings is of course largely unrelated to all this. Yet it may provide some
> additional C compatibility (especially now that silent conversion to utf-8
> is out) as well as a workaround for the
> str()-having-to-return-a-narrow-string bottleneck.

I'm not so sure that this is enough.  You seem to propose wide strings
as vehicles for 16-bit values (and maybe later 32-bit values) apart
from their encoding.  We already have a data type for that (the array
module).  The Unicode type does a lot more than storing 16-bit values:
it knows lots of encodings to and from Unicode, and it knows things
like which characters are upper or lower or title case and how to map
between them, which characters are word characters, and so on.  All
this is highly Unicode specific and is part of what people ask for
when then when they request Unicode support.  (Example: Unicode has
405 characters classified as numeric, according to the isnumeric()
method.)

And by the way, don't worry about the comparison.  I'm not changing
the default comparison (==, cmp()) for Unicode strings to be anything
than per 16-bit-quantity.  However a Unicode object might in addition
has a method to do normalization or whatever, as long as it's language
independent and strictly defined by the Unicode standard.
Language-specific operations belong in separate modules.

--Guido van Rossum (home page: http://www.python.org/~guido/)