Writing huge Sets() to disk

Mon Jan 17 07:32:27 EST 2005

Duncan Booth wrote:
> Martin MOKREJ© wrote:
> 
> 
>>Hi,
>>  could someone tell me what all does and what all doesn't copy
>>references in python. I have found my script after reaching some
>>state and taking say 600MB, pushes it's internal dictionaries
>>to hard disk. The for loop consumes another 300MB (as gathered
>>by vmstat) to push the data to dictionaries, then releases
>>little bit less than 300MB and the program start to fill-up
>>again it's internal dictionaries, when "full" will do the
>>flush again ...
> 
> 
> Almost anything you do copies references.

But what does this?:

x = 'xxxxx'
a = x[2:]
b = z + len(x)
dict[a] = b

>>  The point here is, that this code takes a lot of extra memory.
>>I believe it's the references problem, and I remeber complains
>>of frineds facing same problem. I'm a newbie, yes, but don't
>>have this problem with Perl. OK, I want to improve my Pyhton
>>knowledge ... :-))
>>
>>
>>
>>
> 
> <long code extract snipped>
> 
>>
>>   The above routine doesn't release of the memory back when it
>>exits. 
> 
> That's probably because there isn't any memory it can reasonable be 
> expected to release. What memory would *you* expect it to release?

Thos 300MB, they get allocated/reserved when the posted loop get's
executed. When the loops exits, almost all is returned/deallocated.
Yes, almost. :(

> 
> The member variables are all still accessible as member variables until you 
> run your loop at the end to clear them, so no way could Python release 
> them.

OK, I wanted to know if there's some assignment using a reference,
which makes the internal garbage collector not to recycle the memory,
as, for example, the target dictionary still keeps reference to the temporary
dictionary.

> 
> Some hints:
> 
> When posting code, try to post complete examples which actually work. I 
> don't know what type the self._dict_on_diskXX variables are supposed to be. 
> It makes a big difference if they are dictionaries (so you are trying to 
> hold everything in memory at one time) or shelve.Shelf objects which would 
> store the values on disc in a reasonably efficient manner.

The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin dictionaries.

> 
> Even if they are Shelf objects, I see no reason here why you have to 

I gathered from previous discussion it's faster to use bsddb directly,
so no shelve.

> process everything at once. Write a simple function which processes one 
> tmpdict object into one dict_on_disk object and then closes the

That's what I do, but in the for loop ...

> dict_on_disk object. If you want to compare results later then do that by

OK, I got your point.

> reopening the dict_on_disk objects when you have deleted all the tmpdicts.

That's what I do (not shown).

> 
> Extract out everything you want to do into a class which has at most one 
> tmpdict and one dict_on_disk That way your code will be a lot easier to 
> read.
> 
> Make your code more legible by using fewer underscores.
> 
> What on earth is the point of an explicit call to __add__? If Guido had 
> meant us to use __add__ he woudn't have created '+'.

To invoke additon directly on the object. It's faster than letting
python to figure out that I sum up int() plus int(). It definitely
has helped a lot when using Decimal(a) + Decimal(b), where I got rid
of thousands of Decimal(__new__), __init__ and I think few other
methods of decimal as well - I think getcontext too. 

> What is the purpose of dict_on_disk? Is it for humans to read the data? If 
> not, then don't store everything as a string. Far better to just store a 

For humans is processed later.

> tuple of your values then you don't have to use split or cast the strings 

bsddb creaps on me that I can store as a key or value only a string.
I'd love to store tuple.

>>> import bsddb
>>> _words1 = bsddb.btopen('db1.db', 'c')
>>> _words1['a'] = 1    

Traceback (most recent call last):
 File "<stdin>", line 1, in ?
 File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
   self.db[key] = value
TypeError: Key and Data values must be of type string or None.

>>>

How can I record a number then?

> to integers. If you do want humans to read some final output then produce 
> that separately from the working data files.
> 
> You split out 4 values from dict_on_disk and set three of them to 0. If 
> that really what you meant or should you be preserving the previous values?

No, overwrite them, i.e. invalidate them. Originally I recorded only first,
but to compute the latter numbers is so expensive I have to store them.
As walking through the dictionaries is so slow, I gave up on an idea to
store just one, and a lot later in the program walk once again through the
dictionary and 'fix' it by computing missing values.

> 
> Here is some (untested) code which might help you:
> 
> import shelve

Why shelve? To have the ability to record tuple? Isn't it cheaper
to convert to string and back and write to bsddb compared to this overhead?

> 
> def push_to_disc(data, filename):
>     database = shelve.open(filename)
>     try:
>         for key in data:
>             if database.has_key(key):
>                 count, a, b, expected = database[key]
>                 database[key] = count+data[key], a, b, expected
>             else:
>                 database[key] = data[key], 0, 0, 0
>     finally:
>         database.close()
> 
>     data.clear()
> 
> Call that once for each input dictionary and your data will be written out 
> to a disc file and the internal dictionary cleared without any great spike 
> of memory use.

Can I use the mmap() feature on bsddb or any .db file? Most of the time I do
updates, not inserts! I don't want to rewrite all the time 300MB file.
I want to update it. What I do need for it? Know the maximal length of a string
value keept in the .db file? Can I get rid of locking support in those huge
files?

Definitely I can improve my algorithm. But I believe I'll always have to work
with those huge files.
Martin