[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Thu, 11 Nov 1999 15:26:59 +0100


Tim Peters wrote:
> 
> [/F]
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
> 
> [MAL]
> > No, but people are already thinking about it and there is
> > a defined range in the >16-bit area for private encodings
> > (F0000..FFFFD and 100000..10FFFD).
> 
> Over the decades I've developed a rule of thumb that has never wound up
> stuck in my ass <wink>:  If I engineer code that I expect to be in use for N
> years, I make damn sure that every internal limit is at least 10x larger
> than the largest I can conceive of a user making reasonable use of at the
> end of those N years.  The invariable result is that the N years pass, and
> fewer than half of the users have bumped into the limit <0.5 wink>.
> 
> At the risk of offending everyone, I'll suggest that, qualitatively
> speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
> replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
> when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
> comfort range for some individual languages.  In just a few months, Unicode
> 3 will already have used up > 56K of the 64K slots.
> 
> As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
> zone, for about a decade.

If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
signal failure of this assertion at Unicode object construction time
via an exception. That way we are within the standard, can use
reasonably fast code for Unicode manipulation and add those extra 1M
character at a later stage.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/