Python good for data mining?

Mon Nov 5 21:09:10 EST 2007

On Nov 5, 10:29 am, Maarten <maarten.sn... at knmi.nl> wrote:
> On Nov 5, 1:51 pm, Jens <j3n... at gmail.com> wrote:
>
>
>
> > On 5 Nov., 04:42, "D.Hering" <vel.ac... at gmail.com> wrote:
>
> > > On Nov 3, 9:02 pm, Jens <j3n... at gmail.com> wrote:
>
> > > I then leaned C and then C++. I am now coming home to Python realizing
> > > after my self-eduction, that programming in Python is truly a pleasure
> > > and the performance is not the concern I first considered to be.
> > > Here's why:
>
> > > Python is very easily extended to near C speed. The Idea that FINALLY
> > > sunk in, was that I should first program my ideas in Python WITHOUT
> > > CONCERN FOR PERFOMANCE. Then, profile the application to find the
> > > "bottlenecks" and extend those blocks of code to C or C++. Cython/
> > > Pyrex/Sip are my preferences for python extension frameworks.
>
> > > Numpy/Scipy are excellent libraries for optimized mathematical
> > > operations. Pytables is my preferential python database because of
> > > it's excellent API to the acclaimed HDF5 database (used by very many
> > > scientists and government organizations).
>
> > So what you're saying is, don't worry about performance when you start
> > coding, but use profiling and optimization in C/C++. Sounds
> > reasonable. It's been 10 years ago since I've done any programming in C
> > ++, so I have to pick up on that soon I guess.
>
> "Premature optimization is the root of all evil", to quote a famous
> person. And he's right, as most people working larger codes will
> confirm.
On Nov 5, 10:29 am, Maarten <maarten.sn... at knmi.nl> wrote:
> On Nov 5, 1:51 pm, Jens <j3n... at gmail.com> wrote:
>
>
>
> > On 5 Nov., 04:42, "D.Hering" <vel.ac... at gmail.com> wrote:
>
> > > On Nov 3, 9:02 pm, Jens <j3n... at gmail.com> wrote:
>
> > > I then leaned C and then C++. I am now coming home to Python realizing
> > > after my self-eduction, that programming in Python is truly a pleasure
> > > and the performance is not the concern I first considered to be.
> > > Here's why:
>
> > > Python is very easily extended to near C speed. The Idea that FINALLY
> > > sunk in, was that I should first program my ideas in Python WITHOUT
> > > CONCERN FOR PERFOMANCE. Then, profile the application to find the
> > > "bottlenecks" and extend those blocks of code to C or C++. Cython/
> > > Pyrex/Sip are my preferences for python extension frameworks.
>
> > > Numpy/Scipy are excellent libraries for optimized mathematical
> > > operations. Pytables is my preferential python database because of
> > > it's excellent API to the acclaimed HDF5 database (used by very many
> > > scientists and government organizations).
>
> > So what you're saying is, don't worry about performance when you start
> > coding, but use profiling and optimization in C/C++. Sounds
> > reasonable. It's been 10 years ago since I've done any programming in C
> > ++, so I have to pick up on that soon I guess.
>
> "Premature optimization is the root of all evil", to quote a famous
> person. And he's right, as most people working larger codes will
> confirm.
>
> As for pytables: it is the most elegant programming interface for HDF
> on any platform that I've encountered so far. Most other platforms
> stay close the HDF5 library C-interface, which is low-level, and quite
> complex. PyTables was written with the end-user in mind, and it shows.
> One correction though: PyTables is not a database: it is a storage for
> (large) arrays, datablocks that you don't want in a database. Use a
> database for the metadata to find the right file and field within that
> file. Keep in mind though that I mostly work with externally created
> HDF-5 files, not with files created in pytables. PyTables Pro has an
> indexing feature which may be helpful for datamining (if you write the
> hdf-5 files from python).
>
> Maarten

Hi Maarten,

I respectfully disagree that HDF5 is not a DB. Its true that HDF5 on
its prima facie is not relational but rather hierarchical.

Hierarchical is truely a much more natural/elegant[1] design from my
perspective. HDF has always had meta-data capabilities and with the
new 1.8beta version available, it is increasing its ability with
'references/links' allowing for pure/partial relational datasets,
groups, and files as well as storing self implemented indexing.

The C API is obviously much more low level, and Pytables does not yet
support these new features.

[1] Anything/everything that is physical/virtual, or can be conceived
is hierarchical... if the system itself is not random/chaotic. Thats a
lovely revelation I've had... EVERYTHING is hierarchical. If it has
context it has hierarchy.