Duplicates in lists

Dana Booth dana at oz.net
Thu Mar 9 02:51:10 EST 2000


Michael Husmann <michael.husmann at teleatlas.com> wrote:

MH: is there someone who has an efficient function that finds
MH: all duplicates in a list?
MH: I used a hash which works quite reasonable but maybe there
MH: is a better way.

I made a Perl script once that that did that, it's quite fast, since I
needed to de-dupe very large textfiles. (over 80,000 lines and still growing
today.) The concept could be exported easily, if you wanted to.

Getting speed sacrificed the integrity of the list in a small way, though,
in that the list is re-ordered. Here's how it worked...

The entire file is read one line at a time into an array. Then, use the
built-in sort function to sort the array. Assign another variable to contain
a joined version of the list. (that variable will be very large) Now, read
through the list from the beginning, one element at a time. Find the length
of the current element, and use that as a byte count to run up the large
string variable to the first byte after the end of the current element from
the list. Then, look ahead in the large string variable for a substring
match.

Anyway, the whole script is only like 20 lines. On a department file server,
an OpenBSD i386 computer with 256mb ram, it takes about 2 seconds to de-dupe
the ~80,000 line textfile. On another (Linux) computer with only 32mb ram,
it takes about 45 seconds, so memory is big...

Hope this helps!

-- 
-----
Dana Booth <dana at oz.net>
Tacoma, Wa., USA

key at pgpkeys.mit.edu:11371



More information about the Python-list mailing list