Unicode codepoints

Wed Jun 22 18:15:19 EDT 2011

2011/6/22 Saul Spatz <saul.spatz at gmail.com>:
> Thanks.  I agree with you about the generator.  Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16.  So instead of U=10FFFF I get U+DBFF and U+DFFF.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:

>>> surrog_txt=u"a𐌰 𐌱 𐌲 𐌳"
>>> surrog_txt
u'a\U00010330 \U00010331 \U00010332 \U00010333'
>>> print surrog_txt
a𐌰 𐌱 𐌲 𐌳
>>> list(surrog_txt)
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']
>>> import re
>>> re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)
[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
>>>

this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
   vbr