Lists, tuples and memory.

Thu Jul 15 15:09:47 EDT 2004

Hi, all!

Here is the problem:
I have a file, which contains a common dictionary - one word per line
(appr. 700KB and 70000 words). I have to read it in memory for future
"spell checking" of the words comming from the customer. The file is
presorted. So here it goes:

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt"))

Works like a charm. It takes on my machine 0.7 seconds to do the trick
and python.exe (from task manager data) was using before this line is
executed
2636K, and after 5520K. The difference is 2884K. Not that bad, taking
into account that in C I'd read the file in memory (700K) scan for
CRs, count them, replace with '\0' and allocate the index vector of
the word beginnigs of the size I found while counting CRs. In this
particular case index vector would be almost 300K. So far so good!

Then I realized, that lstdict as a list is an overkill. Tuple is
enough in my case. So I modified the code:

t = tuple(file("D:\\CommonDictionary.txt"))
lstdict = map(lambda x: x.lower().strip(), t)

This code works a little bit faster: 0.5 sec, but takes 5550K memory.
And maybe this is understandable: after all the first line creates a
list and a tuple and the second another tuple (all of the same size).

But then I decieded to compare this pieces:

t = tuple(file("D:\\CommonDictionary.txt"))        # 1
lstdict = map(lambda x: x.lower().strip(), t)      # 2

lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 3

As expected, after line 2 memory was 5550K, but after line 3 it jumped
to 7996K!!!

The question:

If refference counting is used, why the second assignment to lstdict
(line 3) did not free the memory alocated by the first one (line 2)
and reuse it?

So one more experiment:

t = tuple(file("D:\\CommonDictionary.txt"))        # 1
lstdict = map(lambda x: x.lower().strip(), t)      # 2
del lstdict                                        # 3
lstdict = map(lambda x: x.lower().strip(),
file("D:\\CommonDictionary.txt")) # 4

In this case executing line 4 did not add memory!

By the way, the search speed is very-very good when done by function:

def inDict(lstdict, word):
    try:
        return lstdict[bisect(lstdict, word) - 1] == word
    except:
        return False

500 words are tested in less then 0.03 seconds.