Python good for data mining?

Jens j3nsby at gmail.com
Mon Nov 5 08:51:33 EST 2007


On 5 Nov., 04:42, "D.Hering" <vel.ac... at gmail.com> wrote:
> On Nov 3, 9:02 pm, Jens <j3n... at gmail.com> wrote:
>
>
>
> > I'm starting a project indatamining, and I'm considering Python and
> > Java as possible platforms.
>
> > I'm conserned by performance. Most benchmarks report that Java is
> > about 10-15 times faster than Python, and my own experiments confirms
> > this. I could imagine this to become a problem for very large
> > datasets.
>
> > How good is the integration with MySQL in Python?
>
> > What about user interfaces? How easy is it to use Tkinter for
> > developing a user interface without an IDE? And with an IDE? (which
> > IDE?)
>
> > What if I were to use my Python libraries with a web site written in
> > PHP, Perl or Java - how do I intergrate with Python?
>
> > I really like Python for a number of reasons, and would like to avoid
> > Java.
>
> > Sorry - lot of questions here - but I look forward to your replies!
>
> All of my programming is data centric. Data mining is foundational
> there in. I started learning computer science via Python in 2003. I
> too was concerned about it's performance, especially considering my
> need for literally trillions of iterations of financial data tables
> with mathematical algorithms.
>
> I then leaned C and then C++. I am now coming home to Python realizing
> after my self-eduction, that programming in Python is truly a pleasure
> and the performance is not the concern I first considered to be.
> Here's why:
>
> Python is very easily extended to near C speed. The Idea that FINALLY
> sunk in, was that I should first program my ideas in Python WITHOUT
> CONCERN FOR PERFOMANCE. Then, profile the application to find the
> "bottlenecks" and extend those blocks of code to C or C++. Cython/
> Pyrex/Sip are my preferences for python extension frameworks.
>
> Numpy/Scipy are excellent libraries for optimized mathematical
> operations. Pytables is my preferential python database because of
> it's excellent API to the acclaimed HDF5 database (used by very many
> scientists and government organizations).
>
> As for GUI framework, I have studied Qt intensely and would therefore,
> very highly recommend PyQt.
>
> After four years of intense study, I can say that with out a doubt,
> Python is most certainly the way to go. I personally don't understand
> why, generally, there is any attraction to Java, though I have yet to
> study it further.

Thanks a lot! I agree, Python is a pleasure to program in.

So what you're saying is, don't worry about performance when you start
coding, but use profiling and optimization in C/C++. Sounds
reasonable. It's been 10 years ago since I've done any programming in C
++, so I have to pick up on that soon I guess.

I've used NumPy for improving my K-Means algorithm, and it now runs
33% faster than "pure" Python. I guess it could be improved upon
further.

I will have a look at PyQt!




More information about the Python-list mailing list