[pypy-dev] PyPy 2 unicode class
Steven D'Aprano
steve at pearwood.info
Thu Jan 23 13:21:55 CET 2014
On Wed, Jan 22, 2014 at 08:01:31AM +0100, Johan Råde wrote:
> At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode
> class. He gave two versions of the design:
>
> A: unicode with a UTF-8 implementation and a UTF-32 interface.
>
> B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows
> and a UTF-32 interface on UNIX-like systems.
With a UTF-8 implementation, won't that mean that string indexing
operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
byte(s) to look at to get the character at index 42 without having to
walk the string from the start?
Have you considered the Flexible String Representation from CPython 3.3?
http://www.python.org/dev/peps/pep-0393/
Basically, if the largest code point in the string is U+00FF or below,
it is implemented using one byte per character (essentially Latin-1); if
the largest code point is U+FFFF or below, it is implemented using two
bytes per character (essentially UCS-2); otherwise, it is implemented
using four bytes per character (UCS-4 or UTF-32). There's more to the
FSR, read the PEP for further detail.
--
Steven
More information about the pypy-dev
mailing list