[XML-SIG] Re: Issues with Unicode type

Sjoerd Mullender sjoerd@acm.org
Tue, 24 Sep 2002 10:19:56 +0200


Nobody seems to have bothered looking at the two characters produced
by u'\u10800'.  I'd say: try it:

+ python
Python 2.3a0 (#78, Sep 20 2002, 11:19:50) 
[GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u"\u10800"
>>> len(c)
2
>>> c
u'\u10800'
>>> c[0]
u'\u1080'
>>> c[1]
u'0'
>>> 

In other words, the \u escape takes the next 4 hex digits and uses
those to create a unicode character, and what's left over is just
appended.
If you use the \U escape you need to provide 8 hex digits:

>>> c = u'\U00010800'
>>> len(c)
2
>>> c[0]
u'\ud802'
>>> c[1] 
u'\udc00'
>>> 

And here we see the surrogates appear.  It's still 2 characters long.

On Mon, Sep 23 2002 Daniel Veillard wrote:

> On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote:
> > > > Can you confirm that this is what RedHat does by default as mentioned
> > > > Uche and do you know the motivations (and eventually downsides) for this
> > > > decision?
> > > 
> > >   By default Red Hat compiles python with unicode support in UTF-16.
> > > I'm not in charge of this, I assume it's the default compilation option.
> > 
> > Not from what we found.  Jeremy was the one who encountered this, not me, but 
> > I'm pretty sure he said he found that starting with RH 7.3, Red Hat started 
> > building Python 2.x with UTF-32 and whchar_t support.
> 
>   Hum, here on 2 recent versions :-)
> 
> paphio:~ -> python2.2
> Python 2.2 (#1, Apr 12 2002, 15:29:57) 
> [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)    
> 2
> >>> 
> 
> gnome:~ -> python
> Python 2.2.1 (#1, Aug 30 2002, 12:15:30) 
> [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)    
> 2
> >>> 
> 
>   looks like UTF16 to me !
> 
> > > IMHO it's a wrong assumption to think that UTF16 is a good cut, because
> > > you end up with variable lenght encoding anyway, and UCS32 would seriously
> > > bloat the app I'm afraid.
> > 
> > Just as a side observation: Guido called this FUD.  I'm not so sure.
> 
>   It's just my opinion, and as a whole me and other in the Gnome and KDE
> projects all went UTF8 without apriori concertation, it was just natural
> to us (okay this also keep strings 0 terminated which is crucial).
> 
> Daniel
> 
> -- 
> Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
> veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
> http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
> 

-- Sjoerd Mullender <sjoerd@acm.org>