newb: comapring two strings

John Machin sjmachin at lexicon.net
Fri May 19 17:14:10 EDT 2006


Paul Rubin wrote:
> bearophileH... at lycos.com writes:
>>> I have suggested C because if the words are all of the same length then
>>> you have 30000^2 = 90 000 000 000 pairs to test.

>> Sorry, you have (n*(n-1))/2  pairs to test (~ 45 000 000 000).

> Still terrible.  Use a better algorithm!

To put all this in perspective, here's a very similar real-world
problem:

You have a customer database with 300,000 records. Some of them are
duplicated, because there are differences in recording the customers'
names, like a one-keystroke typo. The task is to locate pairs of rows
which are likely to be duplicates. 45 billion comaprisons[1] may be
ecxessive.

And here's a clue for a better algorithm: Knuth's TAOCP vol 3 [1973
edition] chapter 6 (searching) third page "as if the words were spelled
backwards". If you find yourself reading about soundex, you've
definitely gone too far :-)

[1]: "comapring" in the subject: is this a game of 'Spot the deliberate
mistake"? Is the OP counting a transposition of adjacent characters as
"varying by one character"?

Cheers




More information about the Python-list mailing list