Python object overhead?

John Machin sjmachin at lexicon.net
Sat Mar 24 03:37:13 EDT 2007


On 24/03/2007 8:11 AM, Matt Garman wrote:
> I'm trying to use Python to work with large pipe ('|') delimited data
> files.  The files range in size from 25 MB to 200 MB.
> 
> Since each line corresponds to a record, what I'm trying to do is
> create an object from each record.

An object with only 1 attribute and no useful methods seems a little 
pointless; I presume you will elaborate it later.

>  However, it seems that doing this
> causes the memory overhead to go up two or three times.
> 
> See the two examples below: running each on the same input file
> results in 3x the memory usage for Example 2.  (Memory usage is
> checked using top.)
> 
> This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
> 2.3.4 on CentOS 4.4 (64bit).
> 
> Is this "just the way it is" or am I overlooking something obvious?
> 
> Thanks,
> Matt
> 
> 
> Example 1: read lines into list:
> # begin readlines.py

Interesting name for the file :-)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?

> import sys, time
> filedata = list()
> file = open(sys.argv[1])

You have just clobbered the builtin file() function/type. In this case 
it doesn't matter, but you should lose the habit, quickly.

> while True:
>    line = file.readline()
>    if len(line) == 0: break # EOF
>    filedata.append(line)
> file.close()
> print "data read; sleeping 20 seconds..."
> time.sleep(20) # gives time to check top

How about using raw_input('Hit the Any key...') ?

> # end readlines.py
> 
> 
> Example 2: read lines into objects:
> # begin readobjects.py
> import sys, time
> class FileRecord:
>    def __init__(self, line):
>        self.line = line
> records = list()
> file = open(sys.argv[1])
> while True:
>    line = file.readline()
>    if len(line) == 0: break # EOF
>    rec = FileRecord(line)
>    records.append(rec)
> file.close()
> print "data read; sleeping 20 seconds..."
> time.sleep(20) # gives time to check top
> # end readobjects.py

After all that, you still need to split the lines into the more-than-one 
fieldS (plural) that one would expect in a record.

A possibly faster alternative to (fastest_line_reader_so_far, 
(line.split('|')) is to use the csv module, as in the following example, 
which also shows one way of making an object out of a row of data.

C:\junk>type readpipe.py
import sys, csv

class Contacts(object):
     __slots__ = ['first', 'family', 'email']
     def __init__(self, row):
         for attrname, value in zip(self.__slots__, row):
             setattr(self, attrname, value)

def readpipe(fname):
     if hasattr(fname, 'read'):
         f = fname
     else:
         f = open(fname, 'rb')
     # 'b' is in case you'd like your script to be portable
     reader = csv.reader(
         f,
         delimiter='|',
         quoting=csv.QUOTE_NONE,
         # Set quotechar to a char that you don't expect in your data
         # e.g. the ASCII control char BEL (0x07). This is necessary
         # for Python 2.3, whose csv module used the quoting arg only when
         # writing, otherwise your " characters may get stripped off.
         quotechar='\x07',
         skipinitialspace=True,
         )
     for row in reader:
         if row == ['']: # blank line
             continue
         c = Contacts(row)
         # do something useful with c, e.g.
         print [(x, getattr(c, x)) for x in dir(c)
                 if not x.startswith('_')]

if __name__ == '__main__':
     if sys.argv[1:2]:
         readpipe(sys.argv[1])
     else:
         print '*** Testing ***'
         import cStringIO
         readpipe(cStringIO.StringIO('''\
             Biff|Bloggs|b1ff at aol.com
             Joseph ("Joe")|Blow|jblow at acoy.com
             "Joe"|Blow|jblow at acoy.com

             Santa|Claus|sclaus at northpole.org
             '''))

C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', 'b1ff at aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', 'Joseph 
("Joe")')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 'sclaus at northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', 'b1ff at aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', 'Joseph 
("Joe")')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 'sclaus at northpole.org'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>

HTH,
John



More information about the Python-list mailing list