Large Dictionaries

Thu May 18 05:55:11 EDT 2006

Chris Foote wrote:
> Claudio Grondi wrote:
>> Chris Foote wrote:
>>> Klaas wrote:
>>>
>>>>> 22.2s  20m25s[3]
>>>>
>>>> 20m to insert 1m keys?  You are doing something wrong.
>>>
>>> I've put together some simplified test code, but the bsddb
>>> module gives 11m for 1M keys:
>>>
>> I have run your code for the bsddb on my P4 2.8 GHz and have got:
>> Number generator test for 1000000 number ranges
>>         with a maximum of 3 wildcard digits.
>> Wed May 17 16:34:06 2006 dictionary population started
>> Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
>> Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
>> Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, 
>> duration 104.3s
>  >
>> Surprising here, that the dictionary population gives the same time, 
>> but the BerkeleyDB inserts the records 6 times faster on my computer 
>> than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?
> 
> Fedora core 5 with ext3 filesystem.  The difference will be due to
> the way that Windows buffers writes for the filesystem you're using
> (it sounds like you're using a FAT-based file system).
Ok, according to the Windows task manager the Python process 
reads/writes to the file system during the run of BerkeleyDB test around 
7 GByte(!) of data and the hard drive is continuously busy, where the 
size of file I found in the Temp directory is always below 20 MByte. The 
hard drive access is probably the main reason for loosing time - here a 
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM 
or as BerkeleyDB can scale to larger data amount it makes not much sense 
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost 
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager,  are below 
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in 
some configurations faster than NTFS) and XP Professional SP2 without 
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have?  I have 2GByte fast DDR 
PC400/3200 dual line RAM. It seems, that you are still not getting 
results within the range others experience running your code, so I 
suppose, it has something to do with the hardware you are using.

> 
>>> Number generator test for 1000000 number ranges
>>>         with a maximum of 3 wildcard digits.
>>> Wed May 17 22:18:17 2006 dictionary population started
>>> Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
>>> Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
>>> Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, 
>>> duration 665.6s
>>> Wed May 17 22:29:33 2006 StorageSQLite population started
>>> Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 
>>> 65.5s
>> As I don't have SQLite installed, it is interesting to see if the 
>> factor 10 in the speed difference between BerkeleyDB and SQLite can be 
>> confirmed by someone else.
>> Why is SQLite faster here? I suppose, that SQLite first adds all the 
>> records and builds the index afterwards with all the records there 
>> (with db.commit()).
> 
> SQLite is way faster because BerkeleyDB always uses a disk file,
> and SQLite is in RAM only.
One of the reasons I put an eye on BerkeleyDB is that it pretends to 
scale to a huge amount (Terrabyte) of data and don't need as much RAM as 
Python dictionary and it is not necessary to save/load pickled version 
of the data (i.e. here the dictionary) from/to RAM in order to work with 
it.
I guess, that in your case BerkeleyDB is for the named reasons probably 
the right way to go, except your data will stay small and the Python 
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio
> 
> Cheers,
> Chris