Python object overhead?

Fri Mar 23 18:04:59 EDT 2007

On Fri, Mar 23, 2007 at 03:11:35PM -0600, Matt Garman wrote:
> I'm trying to use Python to work with large pipe ('|') delimited data
> files.  The files range in size from 25 MB to 200 MB.
> 
> Since each line corresponds to a record, what I'm trying to do is
> create an object from each record.  However, it seems that doing this
> causes the memory overhead to go up two or three times.
> 
> See the two examples below: running each on the same input file
> results in 3x the memory usage for Example 2.  (Memory usage is
> checked using top.)
[snip]

When you are just appending all the lines in a big list your overhead
looks like:

records = []
for line in file_ob:
  records.append(line)

But when you wrap each line in a small class the overhead is

records = []
for line in file_ob:
  records.append(line) # the actual string
  records.append(object()) # allocation for the object instance
  records.append({}) # dictionary for per instance attributes

For small strings like dictionary words the overhead over the second
is about 5x the overhead of a plain list.  Most of it is the per
instance dictionary.

If you make the record a new style class (inherit from object) you can
specify the __slots__ attribute on the class.  This eliminates the per
instance dictionary overhead in exchange for less flexibility.

Another solution would be to only wrap the lines as they are accessed.
Make one class that holds a collection of raw records.  Have that return
a fancy class wrapping each record right before it is used and discarded.

class RecordCollection(object):
  def __init__(self, raw_records):
    self.raw_records = raw_records
  def __getitem__(self, i):
    return Record(self.raw_records[i])

If you are operating on each of the records in serial you only have
the class overhead for one at any given time.

Hope that helps,

-Jack