[SciPy-User] Pylab - standard packages

Sat Sep 22 11:04:47 EDT 2012

On Sat, Sep 22, 2012 at 3:56 AM, Thomas Kluyver <takowl at gmail.com> wrote:

> Andrew: Thanks for the info about h5py. As I don't use HDF5 myself,
> can someone describe, as impartially as possible, the differences
> between PyTables and h5py: how do the APIs differ, any speed
> difference, how well known are they, what do they depend on, and what
> depends on them (e.g. I think pandas can use PyTables?). If it's
> sensible to include both, we can do so, but I'd like to get a feel for
> what they each are.

I'm certainly not unbiased, but while we're waiting for others to
rejoin the discussion I can give my perspective on this question.  I
never saw h5py and PyTables as direct competitors; they have different
design goals.  To me the basic difference is that PyTables is both a
way to talk to HDF5 and a really great database-like interface with
things like indexing, searching, etc. (both NumExpr and Blosc came out
of work on PyTables, I believe).  In contrast, h5py arose by asking
"how can we map the basic HDF5 abstractions to Python in a direct but
still Pythonic way".

The API for h5py has both a high-level and low-level component; like
PyTables, the high-level component is oriented around files, datasets
and groups, allows iteration over elements in the file, etc. The
emphasis in h5py is to use existing objects and abstractions from
NumPy; for example, datasets have .dtype and .shape attributes and can
be sliced like NumPy arrays.  Groups are treated like dictionaries,
are iterable, have .keys() and .iteritems() and friends, etc.

The "main" high level interface in h5py also rests on a huge low-level
interface written in Cython
(http://h5py.alfven.org/docs/low/index.html), which exposes the
majority of the HDF5 C API in a Pythonic, object-oriented way.  The
goal here is anything you can do with HDF5 in C, you can do in Python.

It has no dependencies beyond NumPy and Python itself; I will let
others chime in for specific projects which depend on h5py.  As a
rough proxy for popularity, h5py has roughly 30k downloads over the
life of the project (10k in the past year).

I have never benchmarked PyTables against h5py, but I strongly suspect
PyTables is faster.  Most of the development effort that has recently
gone into h5py has been focused in other areas like API coverage,
Python 3 support, Unicode, and thread safety; we've never done careful
performance testing.

I am eager to hear other perspectives, especially from the PyTables team.

Andrew