[I18n-sig] Unicode strings: an alternative

Just van Rossum just@letterror.com
Thu, 4 May 2000 12:27:45 +0100


I wrote:
>It's a big advantage to have only one string type; it makes many problems
>we've been discussing easier to talk about.

I think I should've been more explicit about what I meant here. I'll try to
phrase it as an addendum to my proposal -- which suddenly is no longer just
a narrow/wide string unification but narrow/wide/ultrawide, to really be
ready for the future...

As someone else suggested in the discussion, I think it's good if we
separate the encoding from the data type. Meaning that wide strings are no
longer tied to Unicode. This allows for double-byte encodings other than
UCS-2 as well as for safe passing-through of binary goop, but that's not
the main point. The main point is that this will make the behavior of
(wide) strings more understandable and consistent.

The extended string type is simply a sequence of code points, allowing for
0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for
ultra-wide strings. Upcasting is always safe, downcasting may raise
OverflowError. Depending on the used encoding, this comes as close as
possible to the sequence-of-characters model.

The default character set should of course be Unicode -- and it should be
obvious that this implies Latin-1 for narrow strings.

(Additionally: an encoding attribute suddenly makes a whole lot of sense
again.)

Ok, y'all can shoot me now ;-)

Just