Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

Mon Jul 15 00:18:46 EDT 2013

On 7/14/2013 10:56 AM, Chris Angelico wrote:
> On Sun, Jul 14, 2013 at 11:44 PM,  <wxjmfauth at gmail.com> wrote:

>>>>> timeit.repeat("a = 'hundred'; 'x' in a")
>> [0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
>>>>> timeit.repeat("a = 'hundreœ'; 'x' in a")
>> [0.23955250303158593, 0.2195812612416752, 0.22133896997401692]
>>>>> sys.version
>> '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

As issue about finding stings in strings was opened last September and, 
as reported on this list, fixes were applied about last March. As I 
remember, some but not all of the optimizations were applied to 3.3. 
Perhaps some were applied too late for 3.3.1 (3.3.2 is 3.3.1 with some 
emergency patches to correct regressions).

Python 3.4.0a2:
 >>> import timeit
 >>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.17396483610667152, 0.16277956641670813, 0.1627937074749941]
 >>> timeit.repeat("a = 'hundreo'; 'x' in a")
[0.18441108179403187, 0.16277311071618783, 0.16270517215355085]

The difference is gone, again, as previously reported.

> jmf has raised an interesting point. Some string membership operations
> do seem oddly slow.

He raised it a year ago and action was taken.

>
> # Get ourselves a longish ASCII string with no duplicates - escape
> apostrophe and backslash for code later on
>>>> asciichars=''.join(chr(i) for i in range(32,128)).replace("\\",r"\\").replace("'",r"\'")
>>>> haystack=[
> 	("ASCII",asciichars+"\u0001"),
> 	("BMP",asciichars+"\u1234"),
> 	("SMP",asciichars+"\U00012345"),
> ]
>>>> needle=[
> 	("ASCII","\u0002"),
> 	("BMP","\u1235"),
> 	("SMP","\U00012346"),
> ]
>>>> useset=[
> 	("",""),
> 	(", as set","; a=set(a)"),
> ]
>>>> for time,desc in sorted((min(timeit.repeat("'%s' in a"%n,("a='%s'"%h)+s)),"%s in %s%s"%(nd,hd,sd)) for nd,n in needle for hd,h in haystack for sd,s in useset):
> 	print("%.10f %s"%(time,desc))
>
> 0.1765129367 ASCII in ASCII, as set
> 0.1767096097 BMP in SMP, as set
> 0.1778647845 ASCII in BMP, as set
> 0.1785266004 BMP in BMP, as set
> 0.1789093307 SMP in SMP, as set
> 0.1790431465 SMP in BMP, as set
> 0.1796504863 BMP in ASCII, as set
> 0.1803854959 SMP in ASCII, as set
> 0.1810674262 ASCII in SMP, as set

Much of this time is overhead; 'pass' would not run too much faster.

> 0.1817367850 SMP in BMP
> 0.1884555160 SMP in ASCII
> 0.2132371572 BMP in ASCII

For these, 3.3 does no searching because it knows from the internal char 
kind that the answer is No without looking.

> 0.3137454621 ASCII in ASCII
> 0.4472624314 BMP in BMP
> 0.6672795006 SMP in SMP
> 0.7493052888 ASCII in BMP
> 0.9261783271 ASCII in SMP
> 0.9865787412 BMP in SMP

...

> Set membership is faster than string membership, though marginally on
> something this short. If the needle is wider than the haystack, it
> obviously can't be present, so a false return comes back at the speed
> of a set check.

Jim ignores these cases where 3.3+ uses the information about the max 
codepoint to do the operation much faster than in 3.2.

> Otherwise, an actual search must be done. Searching
> for characters in strings of the same width gets slower as the strings
> get larger in memory (unsurprising). What I'm seeing of the top-end
> results, though, is that the search for a narrower string in a wider
> one is quite significantly slower.

50% longer is not bad, even

> I don't know of an actual proven use-case for this, but it seems
> likely to happen (eg you take user input and want to know if there are
> any HTML-sensitive characters in it, so you check ('<' in string or
> '&' in string), for instance).

In my editing of code, I nearly always search for words or long names.

  The question is, is it worth
> constructing an "expanded string" at the haystack's width prior to
> doing the search?

I would not make any assumptions about what Python does or does not do 
without checking the code. All I know is that Python uses a modified 
version of one of the pre-process and skip-forward algorithms 
(Boyer-Moore?, Knuth-Pratt?, I forget). These are designed to work 
efficiently with needles longer than 1 char, and indeed may work better 
with longer needles. Searching for an single char in n chars is O(n). 
Searching for a len m needle is potentially O(m*n) and the point of the 
fancy algorithms is make all searches as close to O(n) as possible.

-- 
Terry Jan Reedy