ANN: PyTables 1.1 released

Francesc Altet faltet at carabos.com
Fri Jul 15 13:26:50 CEST 2005


=========================
 Announcing PyTables 1.1
=========================

After quite a few testing iterations, the PyTables development team is
happy to announce the availability of PyTables 1.1.

On this version you will find support for a nice set of new features,
like nested datatypes, enumerated datatypes, nested iterators (for
reading only), support for native HDF5 multidimensional attributes, a
new object for dealing with compressed, non-enlargeable arrays
(CArray), bzip2 compression support and more. Many bugs has been
addressed as well.

Go to the PyTables web site for downloading the new beast:
http://pytables.sourceforge.net/

or keep reading for more info about the new features and bugs fixed.


Changes more in depth
=====================

Improvements:

- Support for nested datatypes is in place. You can now made columns
  of tables that host another columns for an unlimited depth (well,
  theoretically, in practice until the python recursive limit would be
  reached). Convenient NestedRecArray objects has been implemented as
  data containers. Cols and Description accessors has been improved so
  you can navigate on the type hierarchy very easily (natural naming
  is has been implemented for the task).

- ``Table``, ``EArray`` and ``VLArray`` objects now support enumerated
  types.  ``Array`` objects support opening existing HDF5 enumerated
  arrays.  Enumerated types are restricted sets of ``(name, value)``
  pairs.  Use the ``Enum`` class to easily define new enumerations
  that will be saved along with your data.

- Now, the HDF5 library is responsible to do data conversions when the
  datasets are written in a machine with different byte-ordering than
  the machine that reads the dataset. With this, all the data is
  converted on-the-fly and you always get native datatypes in
  memory. I think this approach to be more convenient in terms of CPU
  consumption when using these datasets. Right now, this only works
  for tables, though.

- Added support for native HDF5 multidimensional attributes. Now, you
  can load native HDF5 files that contains fully multidimensional
  attributes; these attributes will be mapped to NumArray
  objects. Also, when you save NumArray objects as attributes, they
  get saved as native HDF5 attributes (before, NumArray attributes
  where pickled).

- A brand-new class, called CArray, has been introduced. It's mainly
  like an Array class (i.e. non-enlargeable), but with compression
  capabilities enabled. The existence of CArray also allows PyTables
  to read native HDF5 chunked, non-enlargeable datasets.

- Bzip2 compressor is supported. Such a support was already in
  PyTables 1.0, but forgot to announce it.

- New LZO2 (http://www.oberhumer.com/opensource/lzo/lzonews.php)
  compressor is supported. The installer now recognizes whether LZO1
  or LZO2 is installed, and adapts automatically to it. If both are
  installed in your system, then LZO2 is chosen. LZO2 claims to be
  fully compatible (both backward and forward) with LZO1, so you
  should not experience any problem during this transition.

- The old limit of 256 columns in a table has been released. Now, you
  can have tables with any number of columns, although if you try to
  use a too high number (i.e. > 1024), you will start to consume a lot
  of system resources. You have been warned!.

- The limit in the length of column names has been released also.

- Nested iterators for reading in tables are supported now.

- A new section in tutorial about how to modify values in tables and
  arrays has been added to the User's Manual.

Backward-incompatible changes:

- None.

Bug fixes:

- VLArray now correctly updates the number of rows internal counter
  when opening an existing VLArray object. Now you can add new rows
  to existing VLA's without problems.

- Tuple flavor for VLArrays now works as intended, i.e. reading
  VLArray objects will always return tuples even in the case of
  multidimensional Atoms. Before, this operations returned a mix of
  tuples and lists.

- If a column was not able to be indexed because it has too few
  entries, then _whereInRange is called instead of
  _whereIndexed. Fixes #1203202.

- You can call now Row.append() in the middle of Table iterators without
  resetting loop counters. Fixes #1205588.

- PyTables used to give a segmentation fault when removing the last
  row out of a table with the table.removeRows() method. This is due
  to a limitation in the HDF5 library. Until this get fixed in HDF5, a
  NotImplemented error is raised when trying to do that. Address
  #1201023.

- You can safely break a loop over an iterator returned by
  Table.where(). Fixes #1234637.

- When removing a Group with hidden child groups, those are
  effectively closed now.

- Now, there is a distinction between shapes 1 and (1,) in tables. The
  former represents a scalar, and the later a 1-D array with just one
  element. That follows the numarray convention for records, and makes
  more sense as well. Before 1.1, shapes 1 and (1,) were
  represented by an scalar on disk.

Known bugs:

- Classes inheriting from IsDescription subclasses do not
  inherit columns defined in the super-class. See SF bug #1207732 for
  more info.

- Time datatypes are non-portable between big-endian and little-endian
  architectures. This is ultimately a consequence of a HDF5
  limitation. See SF bug #1234709 for more info.


Important note for MacOSX users
===============================

UCL compressor seems to work badly on MacOSX platforms. Until the
problem would be isolated and eventually solved, UCL will not be
compiled by default on MacOSX platforms, even if the installer finds
it in the system. However, if you still want to get UCL support on
MacOSX, you can use the --force-ucl flag in setup.py.


Important note for Python 2.4 and Windows users
===============================================

If you are willing to use PyTables with Python 2.4 in Windows
platforms, you will need to get the HDF5 library compiled for MSVC
7.1, aka .NET 2003.  It can be found at:
ftp://ftp.ncsa.uiuc.edu/HDF/HDF5/current/bin/windows/5-164-win-net.ZIP

Users of Python 2.3 on Windows will have to download the version of
HDF5 compiled with MSVC 6.0 available in:
ftp://ftp.ncsa.uiuc.edu/HDF/HDF5/current/bin/windows/5-164-win.ZIP


What it is
==========

**PyTables** is a package for managing hierarchical datasets and
designed to efficiently cope with extremely large amounts of data
(with support for full 64-bit file addressing).  It features an
object-oriented interface that, combined with C extensions for the
performance-critical parts of the code, makes it a very easy-to-use
tool for high performance data storage and retrieval.

Perhaps its more interesting feature is that it optimizes memory and
disk resources so that data take much less space (between a factor 3
to 5, and more if the data is compressible) than other solutions, like
for example, relational or object oriented databases.

Besides, PyTables I/O for table objects is buffered, implemented in C
and carefully tuned so that you can reach much better performance with
PyTables than with your own home-grown wrappings to the HDF5
library. PyTables sports indexing capabilities as well, allowing doing
selections in tables exceeding one billion of rows in just seconds.


Where can PyTables be applied?
==============================

PyTables is not designed to work as a relational database competitor,
but rather as a teammate.  If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your
cluttered RDBS, then give PyTables a try.  It works well for storing
data from data acquisition systems (DAS), simulation software, network
data monitoring systems (for example, traffic measurements of IP
packets on routers), very large XML files, or for creating a
centralized repository for system logs, to name only a few possible
uses.


What is a table?
================

A table is defined as a collection of records whose values are stored
in fixed-length fields.  All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and "strict data types" seem to be quite a strange
requirement for a language like Python that supports dynamic data
types, but they serve a useful function if the goal is to save very
large quantities of data (such as is generated by many scientific
applications, for example) in an efficient manner that reduces demand
on CPU time and I/O resources.


What is HDF5?
=============

For those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA.
HDF5 can store two primary objects: datasets and groups.  A dataset is
essentially a multidimensional array of data elements, and a group is
a structure for organizing objects in an HDF5 file.  Using these two
basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids.  You can also mix and match them in
HDF5 files according to your needs.


Platforms
=========

We are using Linux on top of Intel32 as the main development platform,
but PyTables should be easy to compile/install on other UNIX machines.
This package has also been successfully compiled and tested on a
FreeBSD 5.4 with Opteron64 processors, a UltraSparc platform with
Solaris 7 and Solaris 8, a SGI Origin3000 with Itanium processors
running IRIX 6.5 (using the gcc compiler), Microsoft Windows and
MacOSX (10.2 although 10.3 should work fine as well). In particular,
it has been thoroughly tested on 64-bit platforms, like Linux-64 on
top of an Intel Itanium, AMD Opteron (in 64-bit mode) or PowerPC G5
(in 64-bit mode) where all the tests pass successfully.

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.


Web site
========

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

To know more about the company behind the PyTables development, see:

http://www.carabos.com/


Share your experience
=====================

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


----

  **Enjoy data!**

  -- The PyTables Team



More information about the Python-announce-list mailing list