Lists, tuples and memory.

Fri Jul 16 10:04:13 EDT 2004

On 15 Jul 2004 12:09:47 -0700, Elbert Lev <elbertlev at hotmail.com> wrote:
> Hi, all!
> 
> Here is the problem:
> I have a file, which contains a common dictionary - one word per line
> (appr. 700KB and 70000 words). I have to read it in memory for future
> "spell checking" of the words comming from the customer. The file is
> presorted. So here it goes:
> 
> lstdict = map(lambda x: x.lower().strip(),
> file("D:\\CommonDictionary.txt"))
> 
> Works like a charm. It takes on my machine 0.7 seconds to do the trick
> and python.exe (from task manager data) was using before this line is
> executed
> 2636K, and after 5520K. The difference is 2884K. Not that bad, taking
> into account that in C I'd read the file in memory (700K) scan for
> CRs, count them, replace with '\0' and allocate the index vector of
> the word beginnigs of the size I found while counting CRs. In this
> particular case index vector would be almost 300K. So far so good!
> 
> Then I realized, that lstdict as a list is an overkill. Tuple is
> enough in my case. So I modified the code:
> 
> t = tuple(file("D:\\CommonDictionary.txt"))
> lstdict = map(lambda x: x.lower().strip(), t)
> 
> This code works a little bit faster: 0.5 sec, but takes 5550K memory.
> And maybe this is understandable: after all the first line creates a
> list and a tuple and the second another tuple (all of the same size).

any reason you didn't do it as

    lstdict = tuple([x.strip().lower(), file("D:\\CommonDictionary.txt")])

?

also, you might find you can use chunks of the bisect module to speed
up your searching, if you find using a dict too expensive. That's
assuming the bisect module is written in C, which I haven't checked.

-- 
John Lenton (jlenton at gmail.com) -- Random fortune:
bash: fortune: command not found