[I18n-sig] Re: How does Python Unicode treat surrogates?

J M Sykes mike.sykes@acm.org
Mon, 25 Jun 2001 18:38:09 +0100


Mark Davis said:
>
> In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines
that
> tell you information about code points. ...

Anyone on the list interested in the treatment of UCS aka Unicode in
programming languages might like to know that a meeting of ISO/IEC JTC 1/SC
32/WG 3 recently approved a paper that specifies how SQL implementations
should do it.

The proposal can be found at:

ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf

The current CD of the next SQL standard (ISO/IEC 9075), as amended by this
proposal (and many others) can be found at:

ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20
01-06.pdf

Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string
indexing function), and SUBSTRING will all accept a parameter specifying the
units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS
(which to SQL means code points); the default being characters.

This proposal was agreed by major SQL implementors.

Which doesn't mean that it's right, nor that it can't be changed. But that's
how it is at the moment.

Mike.

***********************************************************

J M Sykes              Email: Mike.Sykes@acm.org
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UK                        Tel: (44) 161 437 5413

***********************************************************