Comparing strings from the back?

Thu Sep 6 09:37:23 EDT 2012

On 05.09.2012 18:24, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>>
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
> 
> Excuse me, you are not. You are comparing English words which are highly 
> non-random.

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.

But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.

>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
> 
> I will accept that my explanation was not good enough, but I strongly 
> disagree that I gave no explanation at all.

What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?

>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>>
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "? 
> 
> No, I am pointing out that *your* assumption that most string comparisons 
> will halt close to the beginning of the string is an invalid assumption. 
> Your assumption only holds for some non-random strings.

Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).

>> it's counting the number of character comparisons.
> 
> Correct. But only out of an extremely limited subset of all possible 
> string comparisons, namely those very few that happen when sorting lists 
> of English words using a very specific algorithm, namely timsort.

Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).

> Whatever figure you have calculated by taking this non-random selection, 
> it is irrelevant to the question of the average performance of string 
> equality given random strings.

Correct. I gave the estimate for random strings in my very first post.

> For the sake of simple calculations, let's pretend that there are only 
> 1000 five-letter strings possible. We want to know how many character-
> comparisons are done on average when testing two random five-letter 
> strings for equality. To calculate this average, you must compare every 
> string to every other string, *including* itself.
> 
> This gives 1000*1000 = one million equality tests to be performed. For 
> each equality test, we count the number of character comparisons 
> performed, sum all those counts, and divide by 1e6. That is the average 
> number of char comparisons for random strings of five letters.

Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:

64 combinations for length 6
Char comparisons: 8064
Comparison cnt  : 4096
Avg comparison  : 1.97

128 combinations for length 7
Char comparisons: 32512
Comparison cnt  : 16384
Avg comparison  : 1.98

256 combinations for length 8
Char comparisons: 130560
Comparison cnt  : 65536
Avg comparison  : 1.99

512 combinations for length 9
Char comparisons: 523264
Comparison cnt  : 262144
Avg comparison  : 2.00

1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt  : 1048576
Avg comparison  : 2.00

2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt  : 4194304
Avg comparison  : 2.00

4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt  : 16777216
Avg comparison  : 2.00

Hopefully now you'll see that your assumption O(n) is dead wrong.

> But that's not what you do. First you eliminate 900 out of the 1000 
> possible strings by only sampling English words -- strings like "xxoij" 
> cannot possibly be selected. So you are left with the highly non-random 
> subset of 10000 equality tests.

This is *exactly* what my original calculation did. Enumerate all
possibilites and divide by the total number.

Regards,
Johannes

#!/usr/bin/python3
import itertools
import random

alphabet = [ c for c in "01" ]

def cmpstr(s1, s2):
	c = 0
	for i in range(len(s1)):
		c += 1
		if s1[i] != s2[i]:
			break
	return c

for strlength in range(1, 16):
	combinations = list(itertools.product(*([ alphabet ] * strlength)))

	assert(len(combinations) == len(alphabet) ** strlength)
	print("%d combinations for length %d" % (len(combinations), strlength))

	compcnt = 0
	comparisons = 0
	for c1 in combinations:
		for c2 in combinations:
			comparisons += cmpstr(c1, c2)		
			compcnt += 1
	print("Char comparisons: %d" % (comparisons))
	print("Comparison cnt  : %d" % (compcnt))
	print("Avg comparison  : %.2f" % (comparisons / compcnt))

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1 at speranza.aioe.org>