FSR and unicode compliance - was Re: RE Module Performance

Sun Jul 28 14:03:48 EDT 2013

On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> I posted about a week ago, in response to Chris A., a method by which lookup
> for UTF-16 can be made O(log2 k), or perhaps more accurately,
> O(1+log2(k+1)), where k is the number of non-BMP chars in the string.
>

Which is an optimization choice that favours strings containing very
few non-BMP characters. To justify the extra complexity of out-of-band
storage, you would need to be working with almost exclusively the BMP.
That would drastically improve jmf's microbenchmarks which do exactly
that, but it would penalize strings that are almost exclusively
higher-codepoint characters. Its quality, then, would be based on a
major survey of string usage: are there enough strings with
mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled
better by PEP 393, so this is only of value when there are actually
those mixed strings.

ChrisA