Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

Mon Jul 15 00:31:13 EDT 2013

On Mon, Jul 15, 2013 at 2:18 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 7/14/2013 10:56 AM, Chris Angelico wrote:
> As issue about finding stings in strings was opened last September and, as
> reported on this list, fixes were applied about last March. As I remember,
> some but not all of the optimizations were applied to 3.3. Perhaps some were
> applied too late for 3.3.1 (3.3.2 is 3.3.1 with some emergency patches to
> correct regressions).

D'oh. I knew there was something raised and solved regarding that, but
I forgot to go check a 3.4 alpha to see if it exhibited the same.
Whoops. My bad. Sorry!

> Python 3.4.0a2:
>>>> import timeit
>
>>>> timeit.repeat("a = 'hundred'; 'x' in a")
> [0.17396483610667152, 0.16277956641670813, 0.1627937074749941]
>>>> timeit.repeat("a = 'hundreo'; 'x' in a")
> [0.18441108179403187, 0.16277311071618783, 0.16270517215355085]
>
> The difference is gone, again, as previously reported.

Yep, that looks exactly like I would have hoped it would.

>> 0.1765129367 ASCII in ASCII, as set
>
> Much of this time is overhead; 'pass' would not run too much faster.
>
>> 0.1817367850 SMP in BMP
>> 0.1884555160 SMP in ASCII
>> 0.2132371572 BMP in ASCII
>
> For these, 3.3 does no searching because it knows from the internal char
> kind that the answer is No without looking.

Yeah, I mainly included those results so I could say to jmf "Look, FSR
allows some string membership operations to be, I kid you not, as fast
as set operations!".

>> 0.3137454621 ASCII in ASCII
>> 0.4472624314 BMP in BMP
>> 0.6672795006 SMP in SMP
>> 0.7493052888 ASCII in BMP
>> 0.9261783271 ASCII in SMP
>> 0.9865787412 BMP in SMP
>
>> Otherwise, an actual search must be done. Searching
>> for characters in strings of the same width gets slower as the strings
>> get larger in memory (unsurprising). What I'm seeing of the top-end
>> results, though, is that the search for a narrower string in a wider
>> one is quite significantly slower.
>
> 50% longer is not bad, even

Hard to give an estimate; my first tests were the ASCII in ASCII and
ASCII in BMP, which then looked more like 2:1 time. However, rescaling
the needle to BMP makes it more like the 50% you're quoting, so yes,
it's not as much as I thought.

In any case, the most important thing to note is: 3.4 has already
fixed this, ergo jmf should shut up about it. And here I thought I
could credit him with a second actually-useful report...

ChrisA