Timing of string membership (was Re: hex dump w/ or w/out utf-8 chars)

Sun Jul 14 10:56:29 EDT 2013

On Sun, Jul 14, 2013 at 11:44 PM,  <wxjmfauth at gmail.com> wrote:
> Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :
>> On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:
>>
>>
>>
>> > For a very simple reason, the latin-1 block: considered and accepted
>>
>> > today as beeing a Unicode design mistake.
>>
>>
>>
>> Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational
>>
>> Character Set", which goes back to 1983. ISO-8859-1 was first published
>>
>> in 1985, and was in use on Commodore computers the same year.
>>
>>
>>
>> The concept of Unicode wasn't even started until 1987, and the first
>>
>> draft wasn't published until the end of 1990. Unicode wasn't considered
>>
>> ready for production use until 1991, six years after Latin-1 was already
>>
>> in use in people's computers.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Steven
>
> ------
>
> "Unicode" (in fact iso-14xxx) was not created in one
> night (Deus ex machina).
>
> What's count today is this:
>
>>>> timeit.repeat("a = 'hundred'; 'x' in a")
> [0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
>>>> timeit.repeat("a = 'hundreœ'; 'x' in a")
> [0.23955250303158593, 0.2195812612416752, 0.22133896997401692]
>>>>
>>>>
>>>> sys.getsizeof('d')
> 26
>>>> sys.getsizeof('œ')
> 40
>>>> sys.version
> '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

jmf has raised an interesting point. Some string membership operations
do seem oddly slow.

# Get ourselves a longish ASCII string with no duplicates - escape
apostrophe and backslash for code later on
>>> asciichars=''.join(chr(i) for i in range(32,128)).replace("\\",r"\\").replace("'",r"\'")
>>> haystack=[
	("ASCII",asciichars+"\u0001"),
	("BMP",asciichars+"\u1234"),
	("SMP",asciichars+"\U00012345"),
]
>>> needle=[
	("ASCII","\u0002"),
	("BMP","\u1235"),
	("SMP","\U00012346"),
]
>>> useset=[
	("",""),
	(", as set","; a=set(a)"),
]
>>> for time,desc in sorted((min(timeit.repeat("'%s' in a"%n,("a='%s'"%h)+s)),"%s in %s%s"%(nd,hd,sd)) for nd,n in needle for hd,h in haystack for sd,s in useset):
	print("%.10f %s"%(time,desc))

0.1765129367 ASCII in ASCII, as set
0.1767096097 BMP in SMP, as set
0.1778647845 ASCII in BMP, as set
0.1785266004 BMP in BMP, as set
0.1789093307 SMP in SMP, as set
0.1790431465 SMP in BMP, as set
0.1796504863 BMP in ASCII, as set
0.1803854959 SMP in ASCII, as set
0.1810674262 ASCII in SMP, as set
0.1817367850 SMP in BMP
0.1884555160 SMP in ASCII
0.2132371572 BMP in ASCII
0.3137454621 ASCII in ASCII
0.4472624314 BMP in BMP
0.6672795006 SMP in SMP
0.7493052888 ASCII in BMP
0.9261783271 ASCII in SMP
0.9865787412 BMP in SMP

(In separate testing I ascertained that it makes little difference
whether the character is absent from the string or is the last
character in it. Presumably the figures would be lower if the
character is at the start of the string, but this is not germane to
this discussion.)

Set membership is faster than string membership, though marginally on
something this short. If the needle is wider than the haystack, it
obviously can't be present, so a false return comes back at the speed
of a set check. Otherwise, an actual search must be done. Searching
for characters in strings of the same width gets slower as the strings
get larger in memory (unsurprising). What I'm seeing of the top-end
results, though, is that the search for a narrower string in a wider
one is quite significantly slower.

I don't know of an actual proven use-case for this, but it seems
likely to happen (eg you take user input and want to know if there are
any HTML-sensitive characters in it, so you check ('<' in string or
'&' in string), for instance). The question is, is it worth
constructing an "expanded string" at the haystack's width prior to
doing the search?

ChrisA