efficient data loading with Python, is that possible possible?

John Machin sjmachin at lexicon.net
Wed Dec 12 19:03:17 EST 2007


On Dec 13, 9:48 am, igor.tatari... at gmail.com wrote:
> Hi, I am pretty new to Python and trying to use it for a relatively
> simple problem of loading a 5 million line text file and converting it
> into a few binary files. The text file has a fixed format (like a
> punchcard). The columns contain integer, real, and date values. The
> output files are the same values in binary. I have to parse the values
> and write the binary tuples out into the correct file based on a given
> column. It's a little more involved but that's not important.
>
> I have a C++ prototype of the parsing code and it loads a 5 Mline file
> in about a minute. I was expecting the Python version to be 3-4 times
> slower and I can live with that. Unfortunately, it's 20 times slower
> and I don't see how I can fix that.
>
> The fundamental difference is that in C++, I create a single object (a
> line buffer) that's reused for each input line and column values are
> extracted straight from that buffer without creating new string
> objects. In python, new objects must be created and destroyed by the
> million which must incur serious memory management overhead.

Don't stress out about it; the core devs have put in a few neat
optimisations in the last approx 17 years :-)

> I hope I am missing something.

You probably are: there is a multitude of possible reasons why newbie
code in any language runs slowly. Twenty minutes to process 5M lines
does seem excessive. However without seeing your code we can't help
much.

    int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))

can be improved by looking up the time module for those two functions
once per run rather than twice per date field. Inside your function
[you are doing all this inside a function, not at global level in a
script, aren't you?], do this:
    from time import mktime, strptime # do this ONCE
    ...
    blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

It would help if you told us what platform, what version of Python,
how much memory, how much swap space, ...

Cheers,
John



More information about the Python-list mailing list