[pypy-dev] [cpyext] partial fake PEP393 implementation to provide access to single unicode characters in strings

Sat Apr 14 18:44:53 CEST 2012

Hi,

PEP393 (the new Unicode type in Py3.3) defines a rather useful C interface
towards the characters of a Unicode string. I think it would be cool if
cpyext provided that, so that access to single characters won't require
copying the unicode buffer into C space anymore.

I attached an untested (and likely non-working) patch that adds the most
important parts of it. The implementation does not care about non-BMP
characters, which (if I'm not mistaken) are encoded as surrogate pairs in
PyPy. Apart from that, the functions behave like their CPython
counterparts, which means that the implementation shouldn't get in the way
of a future real PEP393 implementation.

What do you think?

I have no idea if the way the index access is done in PyUnicode_READ_CHAR()
is in any way efficient - would be good if it was. Specifically, the
intention is to avoid creating a 1-character unicode string copy before
taking its ord(). Does this happen automatically, or is there a way to make
sure it does that?

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fake_pep393.patch
Type: text/x-patch
Size: 3138 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20120414/d23e4802/attachment.bin>