[Q:] hash table performance!

Wed Jun 14 05:56:12 EDT 2000

Greetings,

I'm doing a project on large volume information processing. One of the
tasks is to find out the duplicated files under a directory. I believe
Python would be a good powerful tool for that(yes it is, I've
implemented it in 40 lines of codes). However, performance IS a
problem! Since I'm using directory as the hash table (hash_table={}...),
I doubt the bottle neck is in the hash table: How can the generic hash
table fits every cases?

Is there a way of customize the length, hash algorithm of the hash
table in python?  Or anyone can describe how the python hash table
works.

Thanks in advance and any hints will be helpful.

P.S. Background of duplicate files check:
My way of doing it is simple: walk the directories and files, do MD5
coding for every file, use the MD5 code as the hash key, insert the
file name into the hash table. When two files has the same key, check
the contents by bytes.

Sent via Deja.com http://www.deja.com/
Before you buy.