Python good for data mining?

Fri Nov 9 13:09:41 EST 2007

[I've just seen this thread.  Although it might be a bit late, let me
state a couple of precisions]

On 6 Nov, 03:09, "D.Hering" <vel.ac... at gmail.com> wrote:
> On Nov 5, 10:29 am, Maarten <maarten.sn... at knmi.nl> wrote:
>
> > As forpytables: it is the most elegant programming interface for HDF
> > on any platform that I've encountered so far. Most other platforms
> > stay close the HDF5 library C-interface, which is low-level, and quite
> > complex.PyTableswas written with the end-user in mind, and it shows.
> > One correction though:PyTablesis not a database: it is a storage for
> > (large) arrays, datablocks that you don't want in a database. Use a
> > database for the metadata to find the right file and field within that
> > file. Keep in mind though that I mostly work with externally created
> > HDF-5 files, not with files created inpytables.PyTablesPro has an
> > indexing feature which may be helpful for datamining (if you write the
> > hdf-5 files from python).
>
> > Maarten
>
> Hi Maarten,
>
> I respectfully disagree that HDF5 is not a DB. Its true that HDF5 on
> its prima facie is not relational but rather hierarchical.

Yeah.  This largely depends on what we understand by a DB.  Lately,
RDBMs are used everywhere, and we tend to believe that they are the
only entities that can be truly called DBs.  However, in a less
restrictive view, even a text file can be considered a DB (in fact,
many DBs have been implemented using text files as a base).  So, I
wouldn't say that HDF5 is not a DB, but just that it is not a RDBM ;)

> Hierarchical is truely a much more natural/elegant[1] design from my
> perspective. HDF has always had meta-data capabilities and with the
> new 1.8beta version available, it is increasing its ability with
> 'references/links' allowing for pure/partial relational datasets,
> groups, and files as well as storing self implemented indexing.
>
> The C API is obviously much more low level, and Pytables does not yet
> support these new features.

That's correct.  And although it is well possible that we, PyTables
developers, would end incorporating some relational features to it, we
also recognize that PyTables does not intend (and was never in our
plans) to be a competitor of a pure RDBMS, but rather a *teammate* (as
it is clearly stated in the www.pytables.org home page).

In our opinion, PyTables opens the door to a series of capabilities
that are not found in typical RDBMS, like hierarchical classification,
multidimensional datasets, powerful table entities that are able to
deal with multidimensional columns or nested records, but must
specially, the ability to work with extremely large amounts of data in
a very easy way, without having to renounce to first-class speed.

> [1] Anything/everything that is physical/virtual, or can be conceived
> is hierarchical... if the system itself is not random/chaotic. Thats a
> lovely revelation I've had... EVERYTHING is hierarchical. If it has
> context it has hierarchy.

While I agree that this sentence has a part of truth, it is also known
that a lot of things (perhaps much more than we think) in the universe
enter directly in the domain of random/chaotic ;)

IMO, the wisest path should be recognizing the strengths (and
weaknesses) of each approach and use whatever fits better to your
needs.  If you need the best of both then go ahead and choose a RDBMS
in combination with a hierarchical DB, and utilize the powerful
capabilities of Python to take the most out of them.

Cheers,

Francesc Altet