[ANN] PyTables 0.8 is out!

Francesc Alted falted@pytables.org
Mon, 8 Mar 2004 21:52:06 +0100


[Ooops, I should have published this here past week, but I had a mistake]

I'm happy to announce the availability of PyTables 0.8.

PyTables is a hierarchical database package designed to efficiently
manage very large amounts of data. PyTables is built on top of the HDF5
library and the numarray package. It features an object-oriented
interface that, combined with natural naming and C-code generated from
Pyrex sources, makes it a fast, yet extremely easy-to-use tool for
interactively saving and retrieving different kinds of datasets. It also
provides flexible indexed access on disk to anywhere in the data.

In this release you will find:
	- Variable Length Arrays (VLA's) for saving a collection
          of variable length of elements in each row of a dataset.
	- Enlargeable Arrays (EA's) for enlarge homogeneous
       	  datasets on disk.
	- Powerful replication capabilities, ranging from single leaves
	  up to complete hierarchies.
	- With the introduction of the UnImplemented class, greatly=20
	  improved HDF5 native file import capabilities.
	- Two new useful utilities: ptdump & ptrepack.
	- Improved documentation (with the help of Scott Prater).
	- Enhanced platform support. New platforms: MacOSX, FreeBSD,
	  Linux64, IRIX64 (yes, a clean 64-bit port is there) and
	  probably more.
	- More test units (now exceeding 800).
	- Many other minor improvements.

Besides, using PyTables 0.8 a new record on data size has been achieved:
5.6 TB (~ 1000 DVD's) in one single file (!). See the gory details in:
http://pytables.sf.net/html/StressTests.html.

More in detail:

What's new
=2D----------

	- The new VLArray class enables you to store large lists of rows=20
	  containing variable numbers of elements. The elements can=20
	  be scalars or fully multimensional objects, in the PyTables=20
	  tradition. This class supports two special objects as rows:=20
	  Unicode strings (UTF-8 codification is used internally) and=20
	  generic Python objects (through the use of cPickle).

	- The new EArray class allows you to enlarge already existing
	  multidimensional homogeneous data objects. Consider it
	  an extension of the already existing Array class, but=20
	  with more functionality. Online compression or other filters=20
	  can be applied to EArray instances, for example.

	  Another nice feature of EA's is their support for fully
	  multidimensional data selection with extended slices. You can
	  write "earray[1,2:3,...,4:200]", for example, to get the
	  desired dataset slice from the disk. This is implemented using
	  the powerful selection capabilities of the HDF5 library, which
	  results in very highly efficient I/O operations. The same
	  functionality has been added to Array objects as well.

	- New UnImplemented class. If a dataset contains unsupported
	  datatypes, it will be associated with an UnImplemented
	  instance, then inserted into to the object tree as usual.
	  This allows you to continue to work with supported objects
	  while retaining access to attributes of unsupported
	  datasets. This has changed from previous versions, where a
	  RuntimeError occurred when an unsupported object was
	  encountered.

	  The combination of the new UnImplemented class with the=20
	  support for new datatypes will enable PyTables to greatly=20
	  increase the number of types of native HDF5 files that can
	  be read and modified.

	- Boolean support has been added for all the Leaf objects.

	- The Table class has now an append() method that allows you
	  to save large buffers of data in one go (i.e. bypassing the
	  Row accessor). This can greatly improve data gathering
	  speed.

	- The standard HDF5 shuffle filter (to further enhance the
          compression level) is supported.

	- The standard HDF5 fletcher32 checksum filter is supported.

	- As the supported number of filters is growing (and may be
          further increased in the future), a Filters() class has been
          introduced to handle filters more easily. In order to add
          support for this class, it was necessary to make a change in
          the createTable() method that is not backwards compatible:
          the "compress" and "complib" parameters are deprecated now
          and the "filters" parameter should be used in their
          place. You will be able to continue using the old parameters
          (only a Deprecation warning will be issued) for the next few
          releases, but you should migrate to the new version as soon
          as possible. In general, you can easily migrate old code by
          substituting code in its place:

tbl =3D fileh.createTable(group, 'table', Test, '', complevel, complib)

	 =A0should be replaced by

tbl =3D fileh.createTable(group, 'table', Test, '', Filters(complevel, comp=
lib))

	- A copy() method that supports slicing and modification of
 =A0=A0=A0=A0=A0=A0=A0=A0=A0filtering capabilities has been added for all t=
he Leaf
 =A0=A0=A0=A0=A0=A0=A0=A0=A0objects. See the User's Manual for more informa=
tion.

	- A couple of new methods, namely copyFile() and copyChilds(),
 =A0=A0=A0=A0=A0=A0=A0=A0=A0have been added to File class, to permit easy r=
eplication
 =A0=A0=A0=A0=A0=A0=A0=A0=A0of complete hierarchies or sub-hierarchies, eve=
n to
 =A0=A0=A0=A0=A0=A0=A0=A0=A0other files. You can change filters during the =
copy
 =A0=A0=A0=A0=A0=A0=A0=A0=A0process as well.

	- Two new utilities has been added: ptdump and
 =A0=A0=A0=A0=A0=A0=A0=A0=A0ptrepack. The utility ptdump allows the user to=
 examine=20
          the=A0contents of PyTables files (both metadata and actual
 =A0=A0=A0=A0=A0=A0=A0=A0=A0data). The powerful ptrepack utility lets you=20
 =A0=A0=A0=A0=A0=A0=A0=A0=A0selectively copy (portions of) hierarchies to s=
pecific
 =A0=A0=A0=A0=A0=A0=A0=A0=A0locations in other files. It can be also used a=
s an
 =A0=A0=A0=A0=A0=A0=A0=A0=A0importer for generic HDF5 files.

 =A0=A0=A0=A0=A0=A0=A0- The meaning of the stop parameter in read() methods=
 has
 =A0=A0=A0=A0=A0=A0=A0=A0=A0changed. Now a value of 'None' means the last r=
ow, and a
 =A0=A0=A0=A0=A0=A0=A0=A0=A0value of 0 (zero) means the first row. This is =
more
 =A0=A0=A0=A0=A0=A0=A0=A0=A0consistent with the range() function in python =
and the
 =A0=A0=A0=A0=A0=A0=A0=A0=A0__getitem__() special method in numarray.

	- The method Table.removeRows() is no longer limited by table=20
	  size. You can now delete rows regardless of the size of the=20
	  table.

	- The "numarray" value has been added to the flavor parameter
 =A0=A0=A0=A0=A0=A0=A0=A0=A0in the Table.read() method for completeness.

	- The attributes (.attr instance variable) are Python
 =A0=A0=A0=A0=A0=A0=A0=A0=A0properties now. Access to their values is no lo=
nger
 =A0=A0=A0=A0=A0=A0=A0=A0=A0lazy, i.e. you will be able to see both system =
or user
 =A0=A0=A0=A0=A0=A0=A0=A0=A0attributes from the command line using the tab-=
completion
 =A0=A0=A0=A0=A0=A0=A0=A0=A0capability of your python console (if enabled).

	- Documentation has been greatly improved to explain all the
 =A0=A0=A0=A0=A0=A0=A0=A0=A0new functionality. In particular, the internal =
format of
 =A0=A0=A0=A0=A0=A0=A0=A0=A0PyTables is now fully described. You can now bu=
ild
 =A0=A0=A0=A0=A0=A0=A0=A0=A0"native" PyTables files using any generic HDF5=
=A0software=20
          by just duplicating their format.

	- Many new tests have been added, not only to check new
 =A0=A0=A0=A0=A0=A0=A0=A0=A0functionality but also to more stringently chec=
k=20
 =A0=A0=A0=A0=A0=A0=A0=A0=A0existing functionality. There are more than 800=
 different
 =A0=A0=A0=A0=A0=A0=A0=A0=A0tests now (and the number is increasing :).

	- New platforms supported: PyTables has been compiled and tested
	  under GNU/Linux32 (Intel), GNU/Linux64 (AMD Opteron and
	  Alpha), Win32 (Intel), MacOSX (PowerPC), FreeBSD (Intel),
	  Solaris (6, 7, 8 and 9 with UltraSparc), IRIX64 (IRIX 6.5 with
	  R12000) and it probably works in many more architectures. In
	  particular, release 0.8 is the first one that provides a
	  relatively clean porting to 64-bit platforms.

	- As always, some bugs have been solved (especially bugs that
 =A0=A0=A0=A0=A0=A0=A0=A0=A0occur when deleting and/or overwriting attribut=
es).

	- And last, but definitely not least, a new donations section
	  has been=A0added to the PyTables web site
	  (http://sourceforge.net/projects/pytables, then follow the
	  "Donations" tag). If you like PyTables and want this effort
	  to continue, please, donate!


Where PyTables can be applied?
=2D-----------------------------

PyTables is not designed to work as a relational database competitor,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your cluttered
RDBS, then give PyTables a try. It works well for storing data from data
acquisition systems (DAS), simulation software, network data monitoring
systems (for example, traffic measurements of IP packets on routers),
working with very large XML files or as a centralized repository for
system logs, to name only a few possible uses.
=20
What is a table?
=2D---------------

A table is defined as a collection of records whose values are stored in
fixed-length fields. All records have the same structure and all values
in each field have the same data type. =A0The terms "fixed-length" and
"strict data types" seem to be quite a strange requirement for a
language like Python that supports dynamic data types, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.

What is HDF5?
=2D------------

=46or those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA. HDF5
can store two primary objects: datasets and groups. A dataset is
essentially a multidimensional array of data elements, and a group is a
structure for organizing objects in an HDF5 file. Using these two basic
constructs, one can create and store almost any kind of scientific data
structure, such as images, arrays of vectors, and structured and
unstructured grids. You can also mix and match them in HDF5 files
according to your needs.

Platforms
=2D--------

I'm using Linux (Intel 32-bit) as the main development platform, but
PyTables should be easy to compile/install on many other UNIX
machines. This package has also passed all the tests on a UltraSparc
platform with Solaris 7 and Solaris 8. It also compiles and passes all
the tests on a SGI Origin2000 with MIPS R12000 processors, with the
MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux 64-bit
platforms, like an AMD Opteron running SuSe Linux Enterprise Server. It
has also been tested in MacOSX platforms (10.2 but should also work on
newer versions).

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.

An example?
=2D----------

=46or online code examples, have a look at

http://pytables.sourceforge.net/html/tut/tutorial1-1.html

and, for newly introduced Variable Length Arrays:

http://pytables.sourceforge.net/html/tut/vlarray2.html

Web site
=2D-------

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/


Share your experience
=2D--------------------

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Have fun!

=2D- Francesc Alted
falted@pytables.org