[Numpy-discussion] [ANN] PyTables 0.8
Francesc Alted
falted at pytables.org
Wed Mar 3 08:01:04 EST 2004
I'm happy to announce the availability of PyTables 0.8.
PyTables is a hierarchical database package designed to efficiently
manage very large amounts of data. PyTables is built on top of the
HDF5 library and the numarray package. It features an object-oriented
interface that, combined with natural naming and C-code generated from
Pyrex sources, makes it a fast, yet extremely easy-to-use tool for
interactively saving and retrieving very large amounts of data. It also
provides flexible indexed access on disk to anywhere in the data.
PyTables is not designed to work as a relational database competitor,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your cluttered
RDBS, then give PyTables a try. It works well for storing data from data
acquisition systems (DAS), simulation software, network data monitoring
systems (for example, traffic measurements of IP packets on routers),
working with very large XML files or as a centralized repository for
system logs, to name only a few possible uses.
In this release you will find:
- Variable Length Arrays (VLA's) for saving a collection
of variable length of elements in each row of an array.
- Extensible Arrays (EA's) for extending homogeneous
datasets on disk.
- Powerful replication capabilities, ranging from single leaves
up to complete hierarchies.
- With the introduction of the UnImplemented class, greatly
improved HDF5 native file import capabilities.
- Two new useful utilities: ptdump & ptrepack.
- Improved documentation (with the help of Scott Prater).
- New record on data size achieved: 5.6 TB (!) in one single
file.
- Enhanced platform support. New platforms: MacOSX, FreeBSD,
Linux64, IRIX64 (yes, a clean 64-bit port is there) and
probably more.
- More tests units (now exceeding 800).
- Many other minor improvements.
More in detail:
What's new
-----------
- The new VLArray class enables you to store large lists of rows
containing variable numbers of elements. The elements can
be scalars or fully multimensional objects, in the PyTables
tradition. This class supports two special objects as rows:
Unicode strings (UTF-8 codification is used internally) and
generic Python objects (through the use of cPickle).
- The new EArray class allows you to enlarge already existing
multidimensional homogeneous data objects. Consider it
an extension of the already existing Array class, but
with more functionality. Online compression or other filters
can be applied to EArray instances, for example.
Another nice feature of EA's is their support for fully
multidimensional data selection with extended slices. You
can write "earray[1,2:3,...,4:200]", for example, to get the
desired dataset slice from the disk. This is implemented
using the powerful selection capabilities of the HDF5
library, which results in very highly efficient I/O
operations. The same functionality has been added to Array
objects as well.
- New UnImplemented class. If a dataset contains unsupported
datatypes, it will be associated with an UnImplemented
instance, then inserted into to the object tree as usual.
This allows you to continue to work with supported objects
while retaining access to attributes of unsupported
datasets. This has changed from previous versions, where a
RuntimeError occurred when an unsupported object was
encountered.
The combination of the new UnImplemented class with the
support for new datatypes will enable PyTables to greatly
increase the number of types of native HDF5 files that can
be read and modified.
- Boolean support has been added for all the Leaf objects.
- The Table class has now an append() method that allows you
to save large buffers of data in one go (i.e. bypassing the
Row accessor). This can greatly improve data gathering
speed.
- The standard HDF5 shuffle filter (to further enhance the
compression level) is supported.
- The standard HDF5 fletcher32 checksum filter is supported.
- As the supported number of filters is growing (and may be
further increased in the future), a Filters() class has been
introduced to handle filters more easily. In order to add
support for this class, it was necessary to make a change in
the createTable() method that is not backwards compatible:
the "compress" and "complib" parameters are deprecated now
and the "filters" parameter should be used in their
place. You will be able to continue using the old parameters
(only a Deprecation warning will be issued) for the next few
releases, but you should migrate to the new version as soon
as possible. In general, you can easily migrate old code by
substituting code in its place:
table = fileh.createTable(group, 'table', Test, '',
complevel, complib)
should be replaced by
table = fileh.createTable(group, 'table', Test, '',
Filters(complevel, complib))
- A copy() method that supports slicing and modification of
filtering capabilities has been added for all the Leaf
objects. See the User's Manual for more information.
- A couple of new methods, namely copyFile() and copyChilds(),
have been added to File class, to permit easy replication
of complete hierarchies or sub-hierarchies, even to
other files. You can change filters during the copy
process as well.
- Two new utilities has been added: ptdump and
ptrepack. The utility ptdump allows the user to examine
the contents of PyTables files (both metadata and actual
data). The powerful ptrepack utility lets you
selectively copy (portions of) hierarchies to specific
locations in other files. It can be also used as an
importer for generic HDF5 files.
- The meaning of the stop parameter in read() methods has
changed. Now a value of 'None' means the last row, and a
value of 0 (zero) means the first row. This is more
consistent with the range() function in python and the
__getitem__() special method in numarray.
- The method Table.removeRows() is no longer limited by table
size. You can now delete rows regardless of the size of the
table.
- The "numarray" value has been added to the flavor parameter
in the Table.read() method for completeness.
- The attributes (.attr instance variable) are Python
properties now. Access to their values is no longer
lazy, i.e. you will be able to see both system or user
attributes from the command line using the tab-completion
capability of your python console (if enabled).
- Documentation has been greatly improved to explain all the
new functionality. In particular, the internal format of
PyTables is now fully described. You can now build
"native" PyTables files using any generic HDF5 software
by just duplicating their format.
- Many new tests have been added, not only to check new
functionality but also to more stringently check
existing functionality. There are more than 800 different
tests now (and the number is increasing :).
- PyTables has a new record in the data size that fits in one
single file: more than 5 TB (yeah, more than 5000 GB), that
accounts for 11 GB compressed, has been created on an AMD
Opteron machine running Linux-64 (the 64 bits version of the
Linux kernel). See the gory details in:
http://pytables.sf.net/html/HowFast.html.
- New platforms supported: PyTables has been compiled and tested
under Linux32 (Intel), Linux64 (AMD Opteron and Alpha), Win32
(Intel), MacOSX (PowerPC), FreeBSD (Intel), Solaris (6, 7, 8
and 9 with UltraSparc), IRIX64 (IRIX 6.5 with R12000) and it
probably works in many more architectures. In particular,
release 0.8 is the first one that provides a relatively clean
porting to 64-bit platforms.
- As always, some bugs have been solved (especially bugs that
occur when deleting and/or overwriting attributes).
- And last, but definitely not least, a new donations section
has been added to the PyTables web site
(http://sourceforge.net/projects/pytables, then follow the
"Donations" tag). If you like PyTables and want this effort
to continue, please, donate!
What is a table?
----------------
A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type. The terms
"fixed-length" and "strict data types" seem to be quite a strange
requirement for an language like Python that supports dynamic data
types, but they serve a useful function if the goal is to save very
large quantities of data (such as is generated by many scientific
applications, for example) in an efficient manner that reduces demand
on CPU time and I/O resources.
What is HDF5?
-------------
For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.
Platforms
---------
I'm using Linux (Intel 32-bit) as the main development platform, but
PyTables should be easy to compile/install on many other UNIX
machines. This package has also passed all the tests on a UltraSparc
platform with Solaris 7 and Solaris 8. It also compiles and passes all
the tests on a SGI Origin2000 with MIPS R12000 processors, with the
MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux 64-bit
platforms, like an AMD Opteron running SuSe Linux Enterprise Server. It
has also been tested in MacOSX platforms (10.2 but should also work on
newer versions).
Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.
An example?
-----------
For online code examples, have a look at
http://pytables.sourceforge.net/html/tut/tutorial1-1.html
and, for newly introduced Variable Length Arrays:
http://pytables.sourceforge.net/html/tut/vlarray2.html
Web site
--------
Go to the PyTables web site for more details:
http://pytables.sourceforge.net/
Share your experience
---------------------
Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.
Have fun!
-- Francesc Alted
falted at pytables.org
--
Francesc Alted
More information about the NumPy-Discussion
mailing list