[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 15:06:23 CEST 2011

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
> unit) indexes. Then for indexing and slicing, the correction is simple,
> simpler than I first expected:
>  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying
> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
> most of the space penalty and the consequent time penalty of moving more
> bytes around and increasing cache misses.

Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia