Python object overhead?
John Machin
sjmachin at lexicon.net
Sat Mar 24 03:37:13 EDT 2007
On 24/03/2007 8:11 AM, Matt Garman wrote:
> I'm trying to use Python to work with large pipe ('|') delimited data
> files. The files range in size from 25 MB to 200 MB.
>
> Since each line corresponds to a record, what I'm trying to do is
> create an object from each record.
An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.
> However, it seems that doing this
> causes the memory overhead to go up two or three times.
>
> See the two examples below: running each on the same input file
> results in 3x the memory usage for Example 2. (Memory usage is
> checked using top.)
>
> This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
> 2.3.4 on CentOS 4.4 (64bit).
>
> Is this "just the way it is" or am I overlooking something obvious?
>
> Thanks,
> Matt
>
>
> Example 1: read lines into list:
> # begin readlines.py
Interesting name for the file :-)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?
> import sys, time
> filedata = list()
> file = open(sys.argv[1])
You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.
> while True:
> line = file.readline()
> if len(line) == 0: break # EOF
> filedata.append(line)
> file.close()
> print "data read; sleeping 20 seconds..."
> time.sleep(20) # gives time to check top
How about using raw_input('Hit the Any key...') ?
> # end readlines.py
>
>
> Example 2: read lines into objects:
> # begin readobjects.py
> import sys, time
> class FileRecord:
> def __init__(self, line):
> self.line = line
> records = list()
> file = open(sys.argv[1])
> while True:
> line = file.readline()
> if len(line) == 0: break # EOF
> rec = FileRecord(line)
> records.append(rec)
> file.close()
> print "data read; sleeping 20 seconds..."
> time.sleep(20) # gives time to check top
> # end readobjects.py
After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.
A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.
C:\junk>type readpipe.py
import sys, csv
class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)
def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]
if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Biff|Bloggs|b1ff at aol.com
Joseph ("Joe")|Blow|jblow at acoy.com
"Joe"|Blow|jblow at acoy.com
Santa|Claus|sclaus at northpole.org
'''))
C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', 'b1ff at aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 'sclaus at northpole.org'), ('family', 'Claus'), ('first', 'Santa')]
C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', 'b1ff at aol.com'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', 'jblow at acoy.com'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', 'sclaus at northpole.org'), ('family', 'Claus'), ('first', 'Santa')]
C:\junk>
HTH,
John
More information about the Python-list
mailing list