Using Python for processing of large datasets (convincing managment)

Matt Gerrans mgerrans at mindspring.com
Sat Jul 6 15:37:07 EDT 2002


> One of the next development tasks is rewriting the nightly processing
> job which is having problems with our ~100mb database (it it written in
> Borland C++, but absolutely not optimized for speed!).

BCB is great -- and you can still use it for the performance-critical areas
by creating COM Automation servers which are a snap to call from Python.
One of the great things about Python is that it works well with C/C++ so
that you can eat your cake and have it too.

> The goals of the rewritten piece of software would be:
> * Improved speed

Python is not going to help in this area, unfortunately, unless you are
talking about improved speed of development!  ;-)

> * Improved scalability - parallel processing on multiple machines/CPUs

This might be more easily accomplished with Java, depending on exactly how
you intend to implement it.   Java is probably the best tool for distributed
processing; in particular JINI is ideal for this kind of thing.

> * Improved scalability - ability to handle greater databases (>1gb)

This is probably more dependent on your design than the language or platform
you choose.

> * Ability to calculate only a subset of the data

Also dependent more on your design.

> Now, instead of rewriting the job in C++, I'd (of course) like to use
Python.

Naturally!

> However the CEO (small company, told you :-), made a couple of somewhat
> valid points against it.
> 1) He was worried about getting a replacement devlopper in case I left.

I don't think that is a problem at all, these days.   I think Python
developers are becoming pretty ubiquitous.   On top of that, any experienced
programmer can learn Python in a snap -- it is so engaging that it is fun
and quick to learn.

In fact, you can show him some Python and the equivalent C++ as a
demonstration of how much simpler and elegant Python is.   For instance, can
you imagine writing a small program in C++ which will recurse directories
doing a search-and-replace operation with regular expressions support?   It
is a big task in C++, but it is pretty trivial in Python.

> 2) He said, "Name 3 companies using Python for key functions"

I'd bet *every* company in the Fortune 500 uses Python for one thing or
another, whether they know it or not.   Many are probably using it for very
important functions; they just don't advertise it.   Why should they --
their business is not about explaining how they accomplish every task, it is
about doing it.   I have developed Python code for one of the largest of
them that is very key to their business, but I doubt that the CEO would know
of it or that the company would tout this fact -- what they care about is
creating and  selling thier products.

> 3) He was worried about the stability/reliability of python in our
> production environment (you know, 99.999 % and all that)

As long as you are not using it for GUI development, in my experience, it is
extremely solid.

> I was hoping someone in this group could help with some really
> compelling arguments, as I'd really to use Python for this job.

I think the most compelling argument you can come up with is to write a demo
in Python that works on a subset of the data, as you mentioned above.    The
speed with which you can develop and the quality of the code you develop
will be the biggest selling factor.

Be aware that your demo could also convince you that Python is not the right
tool for the job as well.   Python is a great tool, but it is not the best
tool for *every* task.





More information about the Python-list mailing list