High memory usage - program mistake or Python feature?

Fri May 23 09:33:07 EDT 2003

"Ben S" <bens at replytothegroupplease.com> wrote ..
> I wrote a little CGI script that reads in a file like so:
>
> def LoadLogFile(filename):
>     """Loads a log file as a collection of lines"""
>     try:
>         logFile = file(filename, 'rU')
>         lines = map(string.strip, logFile.readlines())
>     except IOError:
>         return False
>     return lines
>
> Then it processes it with this function a few times:
>
> def GetLinesContainingCommand(lines, commandName):
>     """Find all the lines containing that command in the logs"""
>     pattern = re.compile(" Log \w+: " + commandName + " ")
>     return [eachLine for eachLine in lines if pattern.search(eachLine)]
>
> The 'problem' was that, when operating on a 50MB file, the memory usage
> (according to ps on Linux) rocketed to just over 150MB. Since there's no
> other significant storage in the script, I can only assume that the
> lines (corresponding to strings of between 40 and 90 ASCII characters)
> are being stored in such a way that their size is inflated to 3x their
> usual size. I've not specified any Unicode usage anywhere, nor does the
> text file in question use any characters above 127, as far as I know.
> The GetLinesContainingCommand function returns a tiny subset (no more
> than 20 or 30 lines out of tens of thousands) so I doubt it's that
> causing the problem.
>
> So I guess my question is whether I've coded this inefficiently in terms
> of memory usage, or whether this type of overhead has to be expected?
> I'm pretty new to Python so the former sounds likely. Luckily I will
> rarely be operating on 50MB files, but I'm interested in knowing for any
> future scripts I write.
>

It seems to me more likely that the program is actually storing more than
you realise. When you execute

    lines = map(string.strip, logFile.readlines())

the first thing that happens is the creation of a list containing the lines
in your file (50 MB+). Then you create another list, of the stripped lines
in the file (probably close to the same amount if all you're doing is
stripping newlines). You are also doing similar things wiht the list
comprehensions, though I'll take your word that the lists produced are
small.

So there's some list overhead, and since each string will have some object
overhead as well it doesn't seem unreasonable that your program's memory
usage is so high - it's can't deallocate the list from readlines() until
it's completely created the second list, so total usage must account for
them both.

There are techniques you can use to reduce memory: one would be to simply
use xreadlines() rather than readlines(), since that operates in a "lazy"
fashion using generators. Another way might be to use iteration over the
file contents, as in

    for line in logFile:
        ...

which you can do in more recent versions of Python (since 2.2 IIRC).

Even better, process the lines without stripping them if you can!

regards
--
Steve Holden                                  http://www.holdenweb.com/
Python Web Programming                 http://pydish.holdenweb.com/pwp/