ICU wrapper for Python?

Tue Mar 5 18:08:44 EST 2002

Fredrik Juhlin <laz at strakt.com> wrote in message news:<mailman.1015344612.32189.python-list at python.org>...
> On Tue, Mar 05, 2002 at 04:30:41PM +0100, Martin von Loewis wrote:
> > Fredrik Juhlin <laz at strakt.com> writes:
> > 
> > > However, I'm relying on the fact that since Python uses UCS-2 and ICU uses
> > > UTF-16 for their respective internal format, any Python unicode string can
> > > be used as an ICU unicode string. So for the collation I don't need to do
> > > any conversion between the two. To expose the codecs, one would have to
> > > convert the resulting strings from UTF-16 to UCS-2.
> > 
> > I'm a bit slow here: Why do you think Python uses UCS-2 and not
> > (simultaneously) UTF-16? What kind of conversion would you perform?
> Maybe I'm the one that's slow, or possibly horribly confused.
> Actually, I'm pretty damn sure that the docs I read at home said that
> Python used UCS-2 rather than UTF-16. But looking at the online docs
> they're saying UTF-16. So what I thought would be a problem apparently
> won't be. Which is good news :)
> 

Here FWIW is my understanding of the long form of what I think
Martin's point is:

(1) CS == character set. UCS-2 is a 16-bit character set. It covers
the original Unicode 1.0 character set. It is "deprecated and dead" --
so says the ICU web site.

(2) TF == transformation format. UTF-16 is a method of representing a
larger character set (the Unicode 21-bit set) in 16-bit units. It uses
"surrogate pairs" (two 16-bit numbers out of ranges that are reserved
in UCS-2) to represent characters whose ordinal is greater than 65535.

(3) Conversion from UCS-2 to UTF-16: Unnecessary; UCS-2 data is
already in UTF-16 format.

(4) Conversion from UTF-16 to UCS-2: if there are surrogates in the
data, strictly a UCS-2 result is not possible; if there are no
surrogates, then the data is already UCS-2. The only function a
"converter" could usefully perform would be to check for surrogates.

(5) Pythons compiled for "narrow" Unicode (16 bits) support UTF-16
e.g. they don't take exception to the presence of surrogates, count a
valid surrogate pair as one character when dealing in characters as
opposed to code-units, etc.