[XML-SIG] Re: Issues with Unicode type

Mike Brown mike@skew.org
Mon, 23 Sep 2002 15:38:52 -0600 (MDT)


Tom Emerson wrote:
> internally Python is representing characters outside the BMP as a
> surrogate pair in UTF-16, the length of a Unicode string using these
> characters is 2 --- two UTF-16 characters.

To be pedantic, characters are on a different level of abstraction than
surrogate pairs, which are pairs of 16-bit code values.

code value != character

rather,

code value sequence (1 or more) may be equivalent to a character

In UTF-16, many characters can be represented with a single code value, but
some require two code values, both selected from a range of values that are
not individually assigned to characters.

Programming languages still take shortcuts by saying that a 'character' data
type is whatever approximate kind of code value is correct 99% of the time,
which often means you're stuck with no differentiation between the idea of a
character and a single 16-bit code value that represents it internally.

Consequently you find that len(someString) gives you not the number of
characters but the number of code values in the string. And 99% of the time,
that's fine ... until your string contains one of the other (1.1 million minus
65536) characters in Unicode.

So I think the problem here is not that Python says len(u"\uD800\uDC00") is 2
(unless somewhere it says that Python supports Unicode 3.2) but that someone
assumed len() returns a count of Unicode characters...

> If you compile your Python installation to use "wide" Unicode
> characters (i.e., UTF-32), then I expect the behavior to be
> 
> >>> c = u"\U00010000"
> >>> len(c)
> 1

Agreed.

> >>> len(c)
> u'\U00010000'

I think you mean c, not len(c)

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/