[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 09:56:56 CEST 2011

On 8/23/2011 5:46 PM, Terry Reedy wrote:
> On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:
>> Am 23.08.2011 11:46, schrieb Xavier Morel:
>>> Mostly ascii is pretty common for western-european languages 
>>> (French, for
>>> instance, is probably 90 to 95% ascii). It's also a risk in english, 
>>> when
>>> the writer "correctly" spells foreign words (résumé and the like).
>>
>> I know - I still question whether it is "extremely common" (so much as
>> to justify a special case). I.e. on what application with what dataset
>> would you gain what speedup, at the expense of what amount of extra
>> lines, and potential slow-down for other datasets?
> [snip]
>> In the PEP 393 approach, if the string has a two-byte representation,
>> each character needs to widened to two bytes, and likewise for four
>> bytes. So three separate copies of the unrolled loop would be needed,
>> one for each target size.
>
> I fully support the declared purpose of the PEP, which I understand to 
> be to have a full,correct Unicode implementation on all new Python 
> releases without paying unnecessary space (and consequent time) 
> penalties. I think the erroneous length, iteration, indexing, and 
> slicing for strings with non-BMP chars in narrow builds needs to be 
> fixed for future versions. I think we should at least consider 
> alternatives to the PEP393 solution of double or quadrupling space if 
> needed for at least one char.
>
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 
> 'mostly BMP chars, few non-BMP chars' case. Rather than expand every 
> character from 2 bytes to 4, attach an array cpdex of character (ie 
> code point, not code unit) indexes. Then for indexing and slicing, the 
> correction is simple, simpler than I first expected:
>   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying 
> double-byte array. This adds a time penalty of log2(len(cpdex)), but 
> avoids most of the space penalty and the consequent time penalty of 
> moving more bytes around and increasing cache misses.
>
> I believe the same idea would work for utf8 and the mostly-ascii case. 
> The main difference is that non-ascii chars have various byte sizes 
> rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. 
> So the offset correction would not simply be the bisect-left return 
> but would require another lookup
>   byte-index = char-index + offsets[bisect-left(cpdex, char-index)]
>
> If possible, I would have the with-index-array versions be separate 
> subtypes, as in utf16.py. I believe either index-array implementation 
> might benefit from a subtype for single multi-unit chars, as a single 
> non-ASCII or non-BMP char does not need an auxiliary [0] array and a 
> senseless lookup therein but does need its length fixed at 1 instead 
> of the number of base array units.
>
So am I correctly reading between the lines when, after reading this 
thread so far, and the complete issue discussion so far, that I see a 
PEP 393 revision or replacement that has the following characteristics:

1) Narrow builds are dropped.  The conceptual idea of PEP 393 eliminates 
the need for narrow builds, as the internal string data structures 
adjust to the actuality of the data.  If you want a narrow build, just 
don't use code points > 65535.

2) There are more, or different, internal kinds of strings, which affect 
the processing patterns.  Here is an enumeration of the ones I can think 
of, as complete as possible, with recognition that benchmarking and 
clever algorithms may eliminate the need for some of them.

a) all ASCII
b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This 
kind may not be able to support a "mostly" variation, and may be no more 
efficient than case b).  But it might also be popular in parts of Europe 
:)  And appropriate benchmarks may discover whether or not it has worth.
c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient
e) 16-bit codepoints
f) UTF-16 with clever indexing/caching to be efficient
g) 32-bit codepoints
h) UTF-32

When instantiating a str, a new parameter or subtype would restrict the 
implementation to using only a), b), d), f), and h) when fully 
conformant Unicode behavior is desired.  No lone surrogates, no out of 
range code points, no illegal codepoints. A default str would prefer a), 
b), c), e), and g) for efficiency and flexibility.

When manipulations outside of Unicode are necessary [Windows seems to 
use e) for example, suffering from the same sorts of backward 
compatibility problems as Python, in some ways], the default str type 
would permit them, using e) and g) kinds of representations.  Although 
the surrogate escape codec only uses prefix surrogates (or is it only 
suffix ones?) which would never match up, note that a conversion from 
16-bit codepoints to other formats may produce matches between the 
results of the surrogate escape codec, and other unchecked data 
introduced by the user/program.

A method should be provided to validate and promote a string from 
default, unchecked str type to the subtype or variation that enforces 
Unicode, if it qualifies; if it doesn't qualify, an exception would be 
raised by the method.  (This could generally be done in place if the 
value is bound to only a single variable, but would generate a copy and 
rebind the variable to the promoted copy if it is multiply referenced?)

Another parameter or subtype of the conformant str would add grapheme 
support, which has a different set of rules for the clever 
indexing/caching, but could be applied to any of a)†, c)†, d), f), or h).

† It is unnecessary to apply clever indexing/caching to a) and c) kinds 
of string internals, because there is a one-to-one mapping between 
bytes, codepoints, and graphemes in these ranges.  So plain array 
indexing can be used in the implementation of these kinds.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110824/9c09ff6b/attachment.html>