[pypy-dev] PyPy 2 unicode class

Thu Jan 23 13:21:55 CET 2014

On Wed, Jan 22, 2014 at 08:01:31AM +0100, Johan Råde wrote:

> At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode 
> class. He gave two versions of the design:
> 
>  A: unicode with a UTF-8 implementation and a UTF-32 interface.
> 
>  B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows 
> and a UTF-32 interface on UNIX-like systems.

With a UTF-8 implementation, won't that mean that string indexing 
operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 
byte(s) to look at to get the character at index 42 without having to 
walk the string from the start?

Have you considered the Flexible String Representation from CPython 3.3?

http://www.python.org/dev/peps/pep-0393/

Basically, if the largest code point in the string is U+00FF or below, 
it is implemented using one byte per character (essentially Latin-1); if 
the largest code point is U+FFFF or below, it is implemented using two 
bytes per character (essentially UCS-2); otherwise, it is implemented 
using four bytes per character (UCS-4 or UTF-32). There's more to the 
FSR, read the PEP for further detail.

-- 
Steven