PyTables 0.5 released

Francesc Alted falted@openlc.org
Sat, 10 May 2003 13:38:25 +0200


Announcing PyTables 0.5
-----------------------

This is the second public beta release. On this release you will find
a 20% of I/O speed improvement over the previous one (0.4), some bugs
has been fixed and support for a couple of compression (LZO and UCL)
libraries has been added, and... a long awaited Windows version is
finally available!.

More in detail:

What's new
-----------

- As a consequence of some twiking the write/read performance has been
  improved by a 20% overall. One particular case were performance has
  largely increased (0.5 is up to 6 times faster than 0.4) is when
  column elements are unidimensional arrays. This impressive speed-up
  is mainly because of the recent improvements in numarray 0.5
  performance (good work, folks!). With that, the reading speed is
  reaching its theoretical maximum (at least when using the current
  data access schema).

- When reading a Table object, and the user wants to fetch column
  elements which are unidimensional arrays, a copy of the array from
  the I/O buffer is delivered automatically to him, so that there is
  no need to make a call to .copy() method of the numarray arrays
  anymore. It think this is more comfortable for the user.

- The compression was enabled by default in version 0.4, despite of
  what was stated in the documentation. Now, this has been corrected
  and compression is *disabled* by default.

- Support for two new compression libraries: LZO and UCL
  (http://www.oberhumer.com/opensource/). These libraries are made by
  Markus F.X.J. Oberhumer, and they stand for allowing *very* fast
  decompression. Now, if your data is compressible, you can obtain
  better reading speed than if not using compression at all!. The
  improvement is still more noticeable if your are dealing with
  extremely large (and compressible) data sets. Read the online
  documentation  for more info about that:
http://pytables.sourceforge.net/html-doc/usersguide-html3.html#subsection=
3.4.1

- A couple of memory leaks has been isolated and fixed (it was
  hard, but I finally did it!).

- A bug with column ordering of tables that happens in some special
  situations has been fixed (thanks to Stan Heckman for reporting this
  and suggesting the patch).

- File class has now an 'isopen' attribute in order to check if a file
  is open or not.

- Updated documentation, specially for giving advice about the use of
  the new compression libraries. See "Compression issues" subsection,
  (also on the web:
  http://pytables.sourceforge.net/html-doc/usersguide-html.html)

- Added more unit tests (up to 218 now!)

- PyTables has been tested against newest numarray 0.5 and it works
  just fine. It even works well with Python 2.3b1.

- And last, but not least, a Windows version is available!. Thanks to
  Alan McIntyre for its porting!. There is even a binary ready for
  click and install.


What it is
----------

In short, PyTables provides a powerful and very Pythonic interface to
process and organize your table and array data on disk.

Its goal is to enable the end user to manipulate easily scientific
data tables and Numerical and numarray Python objects in a persistent
hierarchical structure. The foundation of the underlying hierarchical
data organization is the excellent HDF5 library
(http://hdf.ncsa.uiuc.edu/HDF5).

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and strict "data types" seems to be quite a strange
requirement for an interpreted language like Python, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.

Quite a bit effort has been invested to make browsing the hierarchical
data structure a pleasant experience. PyTables implements just two
(orthogonal) easy-to-use methods for browsing.

What is HDF5?
-------------

For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.

Platforms
---------

I'm using Linux as the main development platform, but PyTables should
be easy to compile/install on other UNIX machines. This package has
also passed all the tests on a UltraSparc platform with Solaris 7 and
Solaris 8. It also compiles and passes all the tests on a SGI
Origin2000 with MIPS R12000 processors and running IRIX 6.5.

With Windows, PyTables has been tested with Windows 2000 Professional SP1
and Windows XP, but it should also work with other flavors.

An example?
-----------

For online code examples, have a look at

http://pytables.sourceforge.net/tut/tutorial1-1.html

and=20

http://pytables.sourceforge.net/tut/tutorial1-2.html


Web site
--------

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

Share your experience
---------------------

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Have fun!

-- Francesc Alted