[Python-3000] Handling of wide Unicode characters

Josiah Carlson jcarlson at uci.edu
Sat Jun 2 02:44:19 CEST 2007


"Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> Thanks for explanation. Anyway, it certainly much simpler to deal with
> surrogate pairs than with variable-width characters.

I don't know, I really liked my tree overlay that could handle
variable-width characters of any internal encoding (utf-7, utf-8, utf-16).
Of course it takes an extra O(n/logn) space and O(logn) time to access
arbitrary characters in the worst case, but such is the case with
time/space tradeoffs.

 - Josiah

> On 6/1/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> > > Hi,
> > >
> > > I was doing some testing on the new _string_io module, since I was
> > > slightly skeptical on my handling of wide Unicode characters (32-bit
> > > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > > little test:
> > >
> > >    >>> s = _string_io.StringIO()
> > >    >>> s.write(u'晉')
> > >    >>> s.tell()
> > >    2
> > >
> > > Like I expected, wide Unicode characters count for two. However, I was
> > > surprised that Python treats them as two characters as well:
> > >
> > >    >>> len(u'晉')
> > >    2
> > >    >>> u'晉'
> > >    u'\ud87e\udccd'
> > >
> > > Is it a bug, or only an implementation choice?
> >
> > If your Python is compiled as a UTF-16 build, then any character in the
> > extended plane will be seen as two characters by Python.  If you are
> > using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> > the single wide character as a single wide character.  The only
> > exception to this rule is if you enter the wide character as a surrogate
> > pair, in which case Python doesn't normalize it into the single wide
> > character.  To get a real wide character, you would need to use a proper
> > escape, or decode from an encoded string.
> >
> >
> >  - Josiah
> >
> >
> 
> 
> -- 
> Alexandre Vassalotti



More information about the Python-3000 mailing list