ANN: PyTables 0.7 released

Fri, 1 Aug 2003 01:20:56 +0200

Announcing PyTables 0.7
-----------------------

PyTables is a hierarchical database package designed to efficently
manage very large amounts of data. PyTables is built on top of the
HDF5 library and the numarray package and features an object-oriented
interface that, combined with C-code generated from Pyrex sources,
makes it a fast, yet extremely easy to use tool for interactively save
and retrieve large amounts of data.

Release 0.7 is the third public beta release. The version 0.6 was
internal and will never be released.

On this release you will find:
       - new AttributeSet class 
       - 25% I/O speed improvement 
       - fully multidimensional table cells support
       - new column descriptors
       - row deletion in tables is finally here
       - much more!

More in detail:

What's new
-----------

- A new AttributeSet class has been added. This will allow the
  addition and deletion of generic attributes (any scalar type plus
  any Python object supported by Pickle) as easy as this:

  table.attrs.date = "2003/07/28 10:32"     # Attach a string to table
  group._v_attrs.tempShift = 1.2            # Attach a float to group
  array.attrs.detectorList = [1,2,3,4]      # Attach a list to array
  del array.attrs.detectorList        # Detach detectorList attr from array

- PyTables now has support for fully multidimensional table cells. This
  has been made possible in part by implementation of multidimensional
  cells in numarray.records.RecArray object. Thanks to numarray crew,
  and especially to Jin-chung Hsu, for willingly accepting to do
  that, and also for including some cache improvements in RecArray.

- New column descriptors added: IntCol, Int8Col, UInt8Col, Int16Col,
  UInt16Col, Int32Col, UInt32Col, Int64Col, UInt64Col, FloatCol,
  Float32Col, Float64Col and StringCol. I think they are more explicit
  and easy-to-use than the now deprecated (but still supported)
  Col() descriptor. All the examples and user's manual has been
  accordingly updated.

- The new Table.removeRows(start, stop) function allows you to remove 
  rows from tables. This feature was requested a long time ago. There 
  are still limitations, however: you cannot delete rows in extremely 
  large Tables (as the remaining rows after the stop parameter 
  are stored in memory). Nor is the performance optimized.  These issues 
  will hopefully be addressed in future releases.

- Added iterators to File, Group and Table (they now support the special
  __iter__() method). They make the object much more user-friendly,
  especially in interactive mode. See documentation for usage examples.

- Added a __getitem__() method to Table that works more or less like
  read(), but with extended slices support.

- As a consequence of rewriting table iterators in C (with the help of
  Pyrex, of course) the table read performance has been improved
  between 20% and 30%. Data selections in PyTables are now starting to
  beat powerful relational databases like SQLite, even compared to
  in-core selects (!). I think there is still room for another 20% or
  30% speed improvement, so stay tuned.

- A checksum is now added automatically when using LZO (not with UCL
  where I'm having some difficulties implementing that
  capability). The Adler32 algorithm has been chosen because of its
  speed. With that, the compressing/decompressing speed has dropped 1%
  or 2%, which is hardly noticeable. I think this addition will allow
  the cautious user to be a bit more confident about this excellent
  compressor. Code has been added to be able to read files created
  without this checksum (so you can be confident that you will be able
  to read your existing files compressed with LZO and UCL).

- Recursion has been removed from PyTables. Before, this made the
  maximum depth tree to be less than the Python recursion limit (which
  depends on implementation, but is around 900, at least in
  Linux). Now, the limit has been set (somewhat arbitrarily) at
  2048. Thanks to John Nielsen for implementing the new iterative
  method!.

- A new rootUEP parameter to openFile() has been added. You can now 
  define the root from which you want to start to build the object tree. 
  Thanks to John Nielsen for the suggestion and a first implementation.

- A small bug fixed when dealing with non-native PyTables files that
  prevented the use of the "classname" filter during a listNodes()
  call. Thanks to Jeff Robbins for reporting that.

- Some (non-serious) bugs were discovered and fixed.

- Updated documentation to explain all these new bells and whistles. It 
  is also available on the web:
  http://pytables.sourceforge.net/html-doc/usersguide-html.html

- Added more unit tests (more than 350 now!)

- PyTables 0.7 *needs* numarray 0.6 or higher and HDF-1.6.0 or higher
  to compile and work. It has been tested with Python 2.2 and 2.3 and
  should work fine on both versions.

What is a table?
----------------

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and "strict data types" seems to be quite a strange
requirement for an language like Python, that supports dynamic data
types, but they serve a useful function if the goal is to save very
large quantities of data (such as is generated by many scientific
applications, for example) in an efficient manner that reduces demand
on CPU time and I/O resources.

What is HDF5?
-------------

For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.

Platforms
---------

I'm using Linux as the main development platform, but PyTables should
be easy to compile/install on other UNIX machines. This package has
also passed all the tests on a UltraSparc platform with Solaris 7 and
Solaris 8. It also compiles and passes all the tests on a SGI
Origin2000 with MIPS R12000 processors and running IRIX 6.5.

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP, but it should also work with other flavors.

An example?
-----------

For online code examples, have a look at

http://pytables.sourceforge.net/tut/tutorial1-1.html

and 

http://pytables.sourceforge.net/tut/tutorial1-2.html

Web site
--------

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

Share your experience
---------------------

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Have fun!

-- Francesc Alted
falted@openlc.org