Python object overhead?

Mon Mar 26 05:36:38 EDT 2007

Matt Garman a écrit :
> I'm trying to use Python to work with large pipe ('|') delimited data
> files.

Looks like a job for the csv module (in the standard lib).

>  The files range in size from 25 MB to 200 MB.
> 
> Since each line corresponds to a record, what I'm trying to do is
> create an object from each record.  However, it seems that doing this
> causes the memory overhead to go up two or three times.
> 
> See the two examples below: running each on the same input file
> results in 3x the memory usage for Example 2.  (Memory usage is
> checked using top.)

Just for the record, *everything* in Python is an object - so the 
problem is not about 'using objects'. Now Of course, a complex object 
might eat up more space than a simple one...

Python has 2 simple types for structured data : tuples (like database 
rows), and dicts (associative arrays). You can use the csv module to 
parse a csv-like format into either tuples or dicts. If you want to save 
memory, tuples may be the best choice.

> This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
> 2.3.4 on CentOS 4.4 (64bit).
> 
> Is this "just the way it is" or am I overlooking something obvious?

What are you doing with your records ? Do you *really* need to keep the 
whole list in memory ? Else you can just work line by line:

source = open(sys.argv[1])
for line in source:
   do_something_with(line)
source.close()

This will avoid building a huge in-memory list.

While we're at it, your snippets are definitively unpythonic and 
overcomplicated:

(snip)
> filedata = list()
> file = open(sys.argv[1])
> while True:
>    line = file.readline()
>    if len(line) == 0: break # EOF
>    filedata.append(line)
> file.close()
(snip)

filedata = open(sys.argv[1]).readlines())

> Example 2: read lines into objects:
> # begin readobjects.py
> import sys, time
> class FileRecord:

class FileRecord(object):

>    def __init__(self, line):
>        self.line = line

If this is your real code, I don't see any reason why this should eat up 
3 times more space than the original version.

> records = list()
> file = open(sys.argv[1])
> while True:
>    line = file.readline()
>    if len(line) == 0: break # EOF
>    rec = FileRecord(line)
>    records.append(rec)
> file.close()

records = map(FileRecord, open(sys.argv[1]).readlines()))