Comparing strings from the back?

Johannes Bauer dfnsonfsduifb at gmx.de
Wed Sep 5 05:43:08 EDT 2012


On 05.09.2012 11:24, Johannes Bauer wrote:

> Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
> assumes O(1) comparisons! If the comparison operation itself were in
> O(n), the total sorting complexity would be O(n^2 log(n)), which is
> definitely false. Most comparisons will abort *very* early (after the
> first character).

I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words  : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words  : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words  : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words  : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words  : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words  : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words  : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes



#!/usr/bin/python3
import random

class Word():
	def __init__(self, s):
		self._s = s
		self._c = 0

	def __lt__(self, other):
		if len(self) < len(other):
			return True
		elif len(self) > len(other):
			return False

		for i in range(len(self)):
			self._c += 1
			if self._s[i] < other._s[i]:
				return True
		return False

	def __repr__(self):
		return "%s<%d>" % (self._s, self._c)

	def __len__(self):
		return len(self._s)

	def getcmp(self):
		return self._c

wordlist = [ ]
for line in open("dict", "r"):
	word = Word(line[:-1])
	if len(word) == 9:
		wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))


-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1 at speranza.aioe.org>



More information about the Python-list mailing list