efficient data loading with Python, is that possible possible?

Thu Dec 13 08:05:28 EST 2007

igor:
> The fundamental difference is that in C++, I create a single object (a
> line buffer) that's reused for each input line and column values are
> extracted straight from that buffer without creating new string
> objects. In python, new objects must be created and destroyed by the
> million which must incur serious memory management overhead.

Python creates indeed many objects (as I think Tim once said "it
allocates memory at a ferocious rate"), but the management of memory
is quite efficient. And you may use the JIT Psyco (that's currently
1000 times more useful than PyPy, despite sadly not being developed
anymore) that in some situations avoids data copying (example: in
slices). Python is designed for string processing, and from my
experience string processing Psyco programs may be faster than similar
not-optimized-to-death C++/D programs (you can see that manually
crafted code, or from ShedSkin that's often slower than Psyco during
string processing). But in every language I know to gain performance
you need to know the language, and Python isn't C++, so other kinds of
tricks are necessary.

The following advice is useful too:

DouhetSukd:
>Bottom line:  Python built-in data objects, such as dictionaries and
sets, are very much optimized.  Relying on them, rather than writing a
lot of ifs and doing weird data structure manipulations in Python
itself, is a good approach to try.  Try to build those objects outside
of your main processing loops.<

Bye,
bearophile