Writing huge Sets() to disk
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Mon Jan 17 07:32:27 EST 2005
Duncan Booth wrote:
> Martin MOKREJ© wrote:
>
>
>>Hi,
>> could someone tell me what all does and what all doesn't copy
>>references in python. I have found my script after reaching some
>>state and taking say 600MB, pushes it's internal dictionaries
>>to hard disk. The for loop consumes another 300MB (as gathered
>>by vmstat) to push the data to dictionaries, then releases
>>little bit less than 300MB and the program start to fill-up
>>again it's internal dictionaries, when "full" will do the
>>flush again ...
>
>
> Almost anything you do copies references.
But what does this?:
x = 'xxxxx'
a = x[2:]
b = z + len(x)
dict[a] = b
>> The point here is, that this code takes a lot of extra memory.
>>I believe it's the references problem, and I remeber complains
>>of frineds facing same problem. I'm a newbie, yes, but don't
>>have this problem with Perl. OK, I want to improve my Pyhton
>>knowledge ... :-))
>>
>>
>>
>>
>
> <long code extract snipped>
>
>>
>> The above routine doesn't release of the memory back when it
>>exits.
>
> That's probably because there isn't any memory it can reasonable be
> expected to release. What memory would *you* expect it to release?
Thos 300MB, they get allocated/reserved when the posted loop get's
executed. When the loops exits, almost all is returned/deallocated.
Yes, almost. :(
>
> The member variables are all still accessible as member variables until you
> run your loop at the end to clear them, so no way could Python release
> them.
OK, I wanted to know if there's some assignment using a reference,
which makes the internal garbage collector not to recycle the memory,
as, for example, the target dictionary still keeps reference to the temporary
dictionary.
>
> Some hints:
>
> When posting code, try to post complete examples which actually work. I
> don't know what type the self._dict_on_diskXX variables are supposed to be.
> It makes a big difference if they are dictionaries (so you are trying to
> hold everything in memory at one time) or shelve.Shelf objects which would
> store the values on disc in a reasonably efficient manner.
The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin dictionaries.
>
> Even if they are Shelf objects, I see no reason here why you have to
I gathered from previous discussion it's faster to use bsddb directly,
so no shelve.
> process everything at once. Write a simple function which processes one
> tmpdict object into one dict_on_disk object and then closes the
That's what I do, but in the for loop ...
> dict_on_disk object. If you want to compare results later then do that by
OK, I got your point.
> reopening the dict_on_disk objects when you have deleted all the tmpdicts.
That's what I do (not shown).
>
> Extract out everything you want to do into a class which has at most one
> tmpdict and one dict_on_disk That way your code will be a lot easier to
> read.
>
> Make your code more legible by using fewer underscores.
>
> What on earth is the point of an explicit call to __add__? If Guido had
> meant us to use __add__ he woudn't have created '+'.
To invoke additon directly on the object. It's faster than letting
python to figure out that I sum up int() plus int(). It definitely
has helped a lot when using Decimal(a) + Decimal(b), where I got rid
of thousands of Decimal(__new__), __init__ and I think few other
methods of decimal as well - I think getcontext too.
> What is the purpose of dict_on_disk? Is it for humans to read the data? If
> not, then don't store everything as a string. Far better to just store a
For humans is processed later.
> tuple of your values then you don't have to use split or cast the strings
bsddb creaps on me that I can store as a key or value only a string.
I'd love to store tuple.
>>> import bsddb
>>> _words1 = bsddb.btopen('db1.db', 'c')
>>> _words1['a'] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
self.db[key] = value
TypeError: Key and Data values must be of type string or None.
>>>
How can I record a number then?
> to integers. If you do want humans to read some final output then produce
> that separately from the working data files.
>
> You split out 4 values from dict_on_disk and set three of them to 0. If
> that really what you meant or should you be preserving the previous values?
No, overwrite them, i.e. invalidate them. Originally I recorded only first,
but to compute the latter numbers is so expensive I have to store them.
As walking through the dictionaries is so slow, I gave up on an idea to
store just one, and a lot later in the program walk once again through the
dictionary and 'fix' it by computing missing values.
>
> Here is some (untested) code which might help you:
>
> import shelve
Why shelve? To have the ability to record tuple? Isn't it cheaper
to convert to string and back and write to bsddb compared to this overhead?
>
> def push_to_disc(data, filename):
> database = shelve.open(filename)
> try:
> for key in data:
> if database.has_key(key):
> count, a, b, expected = database[key]
> database[key] = count+data[key], a, b, expected
> else:
> database[key] = data[key], 0, 0, 0
> finally:
> database.close()
>
> data.clear()
>
> Call that once for each input dictionary and your data will be written out
> to a disc file and the internal dictionary cleared without any great spike
> of memory use.
Can I use the mmap() feature on bsddb or any .db file? Most of the time I do
updates, not inserts! I don't want to rewrite all the time 300MB file.
I want to update it. What I do need for it? Know the maximal length of a string
value keept in the .db file? Can I get rid of locking support in those huge
files?
Definitely I can improve my algorithm. But I believe I'll always have to work
with those huge files.
Martin
More information about the Python-list
mailing list