It all boils down to how much space your keys take. When you look for dupes, you must hold only the keys in memory, not the data (it'll be a lot faster this way). I'd say create a bsddb with btree sort to hold all your keys. Should take about 20 minutues to fill it. Then scan it in sorted key order, and duplciates will appear next to each other.