What is different with Python ?

Sun Jun 12 08:22:27 EDT 2005

On Sun, 12 Jun 2005, Mike Meyer wrote:

> For instance, one problem was "You have two files that have lists of 1
> billion names in them. Print out a list of the names that only occur
> in one of the files."
>
> That's a one-line shell script: "comm -12 <(sort file_one) <(sort file_two)"

Incidentally, how long does sorting two billion lines of text take?

The complementary question, of course, is "how long does it take to come 
up with an algorithm for solving this problem that doesn't involve sorting 
the files?"!

the best thing i can come up with off the top of my head is making a pass 
over one file to build a Bloom filter [1] describing its contents, then 
going over the second file, checking if each name is in the filter, and if 
it is, putting it in a hashtable, then making a second pass over the first 
file, checking if each name is in the hashtable. this would work without 
the filter, but would require storing a billion names in the hashtable; 
the idea is that using the filter allows you to cut this down to a 
tractable level. that said, i'm not sure if it would work in practice - if 
you have a billion names, even if you have a filter a gigabyte in size, 
you still have a 2% false positive rate [2], which is 20 million names.

tom

[1] http://en.wikipedia.org/wiki/Bloom_filter
[2] http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html

-- 
Think logical, act incremental