Python object overhead?
Bruno Desthuilliers
bruno.42.desthuilliers at wtf.websiteburo.oops.com
Mon Mar 26 05:36:38 EDT 2007
Matt Garman a écrit :
> I'm trying to use Python to work with large pipe ('|') delimited data
> files.
Looks like a job for the csv module (in the standard lib).
> The files range in size from 25 MB to 200 MB.
>
> Since each line corresponds to a record, what I'm trying to do is
> create an object from each record. However, it seems that doing this
> causes the memory overhead to go up two or three times.
>
> See the two examples below: running each on the same input file
> results in 3x the memory usage for Example 2. (Memory usage is
> checked using top.)
Just for the record, *everything* in Python is an object - so the
problem is not about 'using objects'. Now Of course, a complex object
might eat up more space than a simple one...
Python has 2 simple types for structured data : tuples (like database
rows), and dicts (associative arrays). You can use the csv module to
parse a csv-like format into either tuples or dicts. If you want to save
memory, tuples may be the best choice.
> This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
> 2.3.4 on CentOS 4.4 (64bit).
>
> Is this "just the way it is" or am I overlooking something obvious?
What are you doing with your records ? Do you *really* need to keep the
whole list in memory ? Else you can just work line by line:
source = open(sys.argv[1])
for line in source:
do_something_with(line)
source.close()
This will avoid building a huge in-memory list.
While we're at it, your snippets are definitively unpythonic and
overcomplicated:
(snip)
> filedata = list()
> file = open(sys.argv[1])
> while True:
> line = file.readline()
> if len(line) == 0: break # EOF
> filedata.append(line)
> file.close()
(snip)
filedata = open(sys.argv[1]).readlines())
> Example 2: read lines into objects:
> # begin readobjects.py
> import sys, time
> class FileRecord:
class FileRecord(object):
> def __init__(self, line):
> self.line = line
If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.
> records = list()
> file = open(sys.argv[1])
> while True:
> line = file.readline()
> if len(line) == 0: break # EOF
> rec = FileRecord(line)
> records.append(rec)
> file.close()
records = map(FileRecord, open(sys.argv[1]).readlines()))
More information about the Python-list
mailing list