Writing huge Sets() to disk

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Mon Jan 17 05:53:09 EST 2005


Hi,
  could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when "full" will do the
flush again ...
 
  The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))




    def push_to_disk(self):
        _dict_on_disk_tuple = (None, self._dict_on_disk1, self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, self._dict_on_disk20)
        _size = 0
        #
        # sizes of these tmpdicts range from 10-10000 entries for each!
        for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20):
            _size += 1
            if _tmpdict:
                _dict_on_disk = _dict_on_disk_tuple[_size]
                for _word, _value in _tmpdict.iteritems():
                    try:
                        _string = _dict_on_disk[_word]
                        # I discard _a and _b, maybe _string.find(' ') combined with slice would do better?
                        _abs_count, _a, _b, _expected_freq = _string.split()
                        _abs_count = int(_abs_count).__add__(_value)
                        _t = (str(_abs_count), '0', '0', '0')
                    except KeyError:
                        _t = (str(_value), '0', '0', '0')

                    # this writes a copy to the dict, right?
                    _dict_on_disk[_word] = ' '.join(_t)

        #
        # clear the temporary dictionaries in ourself
        # I think this works as expected and really does release memory
        #
        for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20):
            _tmpdict.clear()




   The above routine doesn't release of the memory back when it
exits. 



   See, the loop takes 25 minutes already, and it's prolonging
as the program is in about 1/3 or 1/4 of the total input.
The rest of my code is fast in contrast to this (below 1 minute).

-rw-------  1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
-rw-------  1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
-rw-------  1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
-rw-------  1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
-rw-------  1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
-rw-------  1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
-rw-------  1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
-rw-------  1 mmokrejs users  66584576 Jan 17 10:59 diskdict5.db
-rw-------  1 mmokrejs users   5750784 Jan 17 10:57 diskdict4.db
-rw-------  1 mmokrejs users    311296 Jan 17 10:57 diskdict3.db
-rw-------  1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
-rw-------  1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
-rw-------  1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
-rw-------  1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
-rw-------  1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
-rw-------  1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
-rw-------  1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
-rw-------  1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db

    Some spoke about mmaped files. Could I take advantage of that
with bsddb module or bsddb?

    Is gdbm better in some ways? Recently you have said dictionary
operations are fast ... Once more. I want to turn of locking support.
I can make the values as strings of fixed size, if mmap() would be
available. The number of keys doesn't grow much in time, mostly
there are only updates.

    

Thaks for any ideas.
martin



More information about the Python-list mailing list