efficient data loading with Python, is that possible possible?

George Sakkis george.sakkis at gmail.com
Wed Dec 12 18:50:43 EST 2007


On Dec 12, 5:48 pm, igor.tatari... at gmail.com wrote:
> Hi, I am pretty new to Python and trying to use it for a relatively
> simple problem of loading a 5 million line text file and converting it
> into a few binary files. The text file has a fixed format (like a
> punchcard). The columns contain integer, real, and date values. The
> output files are the same values in binary. I have to parse the values
> and write the binary tuples out into the correct file based on a given
> column. It's a little more involved but that's not important.
>
> I have a C++ prototype of the parsing code and it loads a 5 Mline file
> in about a minute. I was expecting the Python version to be 3-4 times
> slower and I can live with that. Unfortunately, it's 20 times slower
> and I don't see how I can fix that.
>
> The fundamental difference is that in C++, I create a single object (a
> line buffer) that's reused for each input line and column values are
> extracted straight from that buffer without creating new string
> objects. In python, new objects must be created and destroyed by the
> million which must incur serious memory management overhead.
>
> Correct me if I am wrong but
>
> 1) for line in file: ...
> will create a new string object for every input line
>
> 2) line[start:end]
> will create a new string object as well
>
> 3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
> will create 10 objects (since struct_time has 8 fields)
>
> 4) a simple test: line[i:j] + line[m:n] in hash
> creates 3 strings and there is no way to avoid that.
>
> I thought arrays would help but I can't load an array without creating
> a string first: ar(line, start, end) is not supported.
>
> I hope I am missing something. I really like Python but if there is no
> way to process data efficiently, that seems to be a problem.

20 times slower because of garbage collection sounds kinda fishy.
Posting some actual code usually helps; it's hard to tell for sure
otherwise.

George



More information about the Python-list mailing list