Populating a dictionary, fast

Mon Nov 12 07:52:24 EST 2007

On Sun, 11 Nov 2007 08:25:15 -0800, Michael Bacarella wrote:

> Firstly, thank you for all of your help so far, I really appreciate it.
> 
>> > So, you think the Python's dict implementation degrades towards
>  O(N)
>> > performance when it's fed millions of 64-bit pseudo-random longs?
>> 
>> No.
> 
> Yes.
> 
> I tried your code (with one change, time on feedback lines) and got the
>  same terrible
> performance against my data set.

Hmmm... you're getting even worse performance than I do. I read the dict 
in pretty quickly considering the limited memory on my system, and only 
run into trouble when I try deleting the dict.

But still... no, Python dicts shouldn't degrade that badly. Whatever is 
going on, it almost certainly isn't a problem with the implementation of 
dicts.

Here's a thought... try reading the data into one big list of key-value 
pairs instead of a dict. If you still get terrible performance, I think 
that pretty much eliminates the dict as the culprit.

> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance.  Start to finish in under a
> minute.

Not a fair comparison, because it is doing something completely 
different. But still valuable to see that your install isn't *completely* 
broken.

>> Notice that the dict is completely read into memory in just two and a
>> half minutes. The script then tries to delete the dict, and 32
>> minutes
>> later is still struggling. That's the point I got sick of waiting and
>> interrupted the script.
>>
>> Conclusion: it's a memory issue, or maybe a garbage collection issue,
>> not a problem with dicts.
> 
> As you can see, not the case at all against my data set.

Yes, I see now that your problems are worse than my problems.

>> (1) Presumably you don't want to run your app with the garbage
>> collector turned off. You might still want to play around with the gc
>> module to see what you can learn.
> 
> As you can see, your version did no better. :(

Somewhat better, but still struggling.

>> (2) More memory will help avoid paging. If you can't get more memory,
>> try more virtual memory. It will still be slow, but at least the
>> operating
>> system doesn't have to try moving blocks around as much.
> 
> The machine has 8GB, and is not doing anything else when I run this.

Yes, fair enough.

>> (3) Are you sure you need all eight-million-plus items in the cache
>> all at once?
> 
> Yes.

I remain skeptical, but what do I know, I don't even know what you're 
doing with the data once you have it :-)

>> (4) There are lots of algorithms out there for dealing with data too
>> big to fit into main memory. Do some research.
> 
> It DOES fit into main memory and a dictionary is exactly the right way
>  to do this.

I agree now.

I think that you and I are experiencing anomalous results. I'm not even 
certain that it is specifically a Python problem -- I'd be a lot happier 
if somebody got the same terrible performance on a different OS.

What do we have in common? We're using different versions of Python (you 
2.3, me 2.5) and different versions of Linux.

I wonder if we happen to have the same hardware... what's your CPU?

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping        : 1
cpu MHz         : 1000.000
cache size      : 512 KB
...
processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping        : 1
cpu MHz         : 1000.000
cache size      : 512 KB
...

-- 
Steven.