ICU wrapper for Python?

Tue Mar 5 19:17:43 EST 2002

On Tue, Mar 05, 2002 at 03:08:44PM -0800, John Machin wrote:
> Here FWIW is my understanding of the long form of what I think
> Martin's point is:
> 
> (1) CS == character set. UCS-2 is a 16-bit character set. It covers
> the original Unicode 1.0 character set. It is "deprecated and dead" --
> so says the ICU web site.
>
> (2) TF == transformation format. UTF-16 is a method of representing a
> larger character set (the Unicode 21-bit set) in 16-bit units. It uses
> "surrogate pairs" (two 16-bit numbers out of ranges that are reserved
> in UCS-2) to represent characters whose ordinal is greater than 65535.
Any confusion on my part is just that. I've dealt a little with Unicode
before and have read some documentation about it, but I'm well aware that
I am not very well versed in it. I'm learning about it as I go along and
as the knowledge becomes necessary. Thank you for your clarification :)

> (3) Conversion from UCS-2 to UTF-16: Unnecessary; UCS-2 data is
> already in UTF-16 format.
Indeed, that was pretty much what I said/meant/acted from.

> (4) Conversion from UTF-16 to UCS-2: if there are surrogates in the
> data, strictly a UCS-2 result is not possible; if there are no
> surrogates, then the data is already UCS-2. The only function a
> "converter" could usefully perform would be to check for surrogates.
Right, converting it isn't possible. I don't know what I was thinking :)

> (5) Pythons compiled for "narrow" Unicode (16 bits) support UTF-16
> e.g. they don't take exception to the presence of surrogates, count a
> valid surrogate pair as one character when dealing in characters as
> opposed to code-units, etc.
Ah, OK. Good to know.

Anyway, the code I've produced thus far is available at
http://www.strakt.com/~laz/picu/picu-20020306.tar.gz

Thanks a lot for the help and input so far :)

//FJ