[XML-SIG] Re: Issues with Unicode type

Jeremy Kloth jeremy.kloth@fourthought.com
Mon, 23 Sep 2002 16:10:52 -0600


----- Original Message -----
From: "Daniel Veillard" <veillard@redhat.com>
To: "Uche Ogbuji" <uche.ogbuji@fourthought.com>
Cc: "Eric van der Vlist" <vdv@dyomedea.com>; <xml-sig@python.org>
Sent: Monday, September 23, 2002 3:59 PM
Subject: Re: [XML-SIG] Re: Issues with Unicode type


> On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote:
> > > > Can you confirm that this is what RedHat does by default as
mentioned
> > > > Uche and do you know the motivations (and eventually downsides) for
this
> > > > decision?
> > >
> > >   By default Red Hat compiles python with unicode support in UTF-16.
> > > I'm not in charge of this, I assume it's the default compilation
option.
> >
> > Not from what we found.  Jeremy was the one who encountered this, not
me, but
> > I'm pretty sure he said he found that starting with RH 7.3, Red Hat
started
> > building Python 2.x with UTF-32 and whchar_t support.
>
>   Hum, here on 2 recent versions :-)
>
> paphio:~ -> python2.2
> Python 2.2 (#1, Apr 12 2002, 15:29:57)
> [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)
> 2
> >>>
>
> gnome:~ -> python
> Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
> [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)
> 2
> >>>
>
>   looks like UTF16 to me !

However that is really two characters 0x1080 and 0x0030.  \u (lowercase)
only takes 4 hex digits.  \U (uppercase) takes 8 digits.  So to create the
character 0x10800, the sequence should be u'\U0010800'.

To truly see if Python has wide unicode support:

import sys
print sys.maxunicode

if the result is >65536, then it was compiled with "--enable-unicode=ucs4",
which the RPM spec file for python 2.2.1 does use.

--
Jeremy Kloth