efficient data loading with Python, is that possible possible?

Wed Dec 12 20:52:53 EST 2007

On Wed, 12 Dec 2007 14:48:03 -0800, igor.tatarinov wrote:

> Hi, I am pretty new to Python and trying to use it for a relatively
> simple problem of loading a 5 million line text file and converting it
> into a few binary files. The text file has a fixed format (like a
> punchcard). The columns contain integer, real, and date values. The
> output files are the same values in binary. I have to parse the values
> and write the binary tuples out into the correct file based on a given
> column. It's a little more involved but that's not important.

I suspect that this actually is important, and that your slowdown has 
everything to do with the stuff you dismiss and nothing to do with 
Python's object model or execution speed.

> I have a C++ prototype of the parsing code and it loads a 5 Mline file
> in about a minute. I was expecting the Python version to be 3-4 times
> slower and I can live with that. Unfortunately, it's 20 times slower and
> I don't see how I can fix that.

I've run a quick test on my machine with a mere 1GB of RAM, reading the 
entire file into memory at once, and then doing some quick processing on 
each line:

>>> def make_big_file(name, size=5000000):
...     fp = open(name, 'w')
...     for i in xrange(size):
...             fp.write('here is a bunch of text with a newline\n')
...     fp.close()
...
>>> make_big_file('BIG')
>>> 
>>> def test(name):
...     import time
...     start = time.time()
...     fp = open(name, 'r')
...     for line in fp.readlines():
...             line = line.strip()
...             words = line.split()
...     fp.close()
...     return time.time() - start
...
>>> test('BIG')
22.53150200843811

Twenty two seconds to read five million lines and split them into words. 
I suggest the other nineteen minutes and forty-odd seconds your code is 
taking has something to do with your code and not Python's execution 
speed.

Of course, I wouldn't normally read all 5M lines into memory in one big 
chunk. Replace the code 

    for line in fp.readlines():

with

    for line in fp:

and the time drops from 22 seconds to 16.

-- 
Steven