[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 02:46:00 CEST 2011

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:
> Am 23.08.2011 11:46, schrieb Xavier Morel:
>> Mostly ascii is pretty common for western-european languages (French, for
>> instance, is probably 90 to 95% ascii). It's also a risk in english, when
>> the writer "correctly" spells foreign words (résumé and the like).
>
> I know - I still question whether it is "extremely common" (so much as
> to justify a special case). I.e. on what application with what dataset
> would you gain what speedup, at the expense of what amount of extra
> lines, and potential slow-down for other datasets?
[snip]
> In the PEP 393 approach, if the string has a two-byte representation,
> each character needs to widened to two bytes, and likewise for four
> bytes. So three separate copies of the unrolled loop would be needed,
> one for each target size.

I fully support the declared purpose of the PEP, which I understand to 
be to have a full,correct Unicode implementation on all new Python 
releases without paying unnecessary space (and consequent time) 
penalties. I think the erroneous length, iteration, indexing, and 
slicing for strings with non-BMP chars in narrow builds needs to be 
fixed for future versions. I think we should at least consider 
alternatives to the PEP393 solution of double or quadrupling space if 
needed for at least one char.

In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 
'mostly BMP chars, few non-BMP chars' case. Rather than expand every 
character from 2 bytes to 4, attach an array cpdex of character (ie code 
point, not code unit) indexes. Then for indexing and slicing, the 
correction is simple, simpler than I first expected:
   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying 
double-byte array. This adds a time penalty of log2(len(cpdex)), but 
avoids most of the space penalty and the consequent time penalty of 
moving more bytes around and increasing cache misses.

I believe the same idea would work for utf8 and the mostly-ascii case. 
The main difference is that non-ascii chars have various byte sizes 
rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So 
the offset correction would not simply be the bisect-left return but 
would require another lookup
   byte-index = char-index + offsets[bisect-left(cpdex, char-index)]

If possible, I would have the with-index-array versions be separate 
subtypes, as in utf16.py. I believe either index-array implementation 
might benefit from a subtype for single multi-unit chars, as a single 
non-ASCII or non-BMP char does not need an auxiliary [0] array and a 
senseless lookup therein but does need its length fixed at 1 instead of 
the number of base array units.

-- 
Terry Jan Reedy