ANN: Febrl-0.2.1

Peter Christen peter.christen@anu.edu.au
Thu, 26 Jun 2003 16:56:36 +1000


The ANU Data Mining Group is pleased to announce the release of
Febrl 0.2.1, a prototype program code intended to make probabilistic
record linkage easier, faster and more accurate for biomedical and
other researchers.

The programs, known collectively as "Febrl" - Freely Extensible
Biomedical Record Linkage - address the data cleaning and
standardisation tasks which are essential first steps for most
record linkage projects, and provide routines for probabilistic
record linkage and record deduplication.

Since its initial release (Version 0.1) the Febrl system has
undergone a major redesign resulting in an object-oriented approach
which allows easier configuration and is more extensible.

The main features of the Febrl Version 0.2.1 are

- Probabilistic and rules-based cleaning and standardisation
   routines for names, addresses and dates.

- A variety of supplied look-up and frequency tables for names and
   addresses.

- Various comparison functions for names, addresses, dates and
   localities, including approximate string comparisons, phonetic
   encodings, geographical distance comparisons, and time and age
   comparisons.

- Several blocking (indexing) methods, including the traditional
   compound key blocking used in many record linkage programs.

- Probabilistic record linkage routines based on the classical
   Fellegi and Sunter approach, as well as a 'flexible classifier'
   that allows a flexible definition of the weight
   calculation.

- Process indicators that give estimations of remaining processing
   times.

- Access methods for fixed format and comma-separated value (CSV)
   text files, as well as SQL databases.

- Efficient temporary direct random access data set based on the
   Berkeley database library.

- One-to-one assignment procedure for linked record pairs based on
   the 'Auction' algorithm.

- Supports parallelism for higher performance on parallel
   platforms, based on MPI (Message Passing Interface), a standard
   for parallel programming, and Pypar, an efficient and easy-to-use
   module that allows Python programs to run in parallel on multiple
   processors and communicate using MPI.

- A database generator which allows the creation of data sets of
   randomly created records (containing names, addresses and dates)
   with the possibility to include duplicate records with randomly
   introduced modifications. This allows for easy testing and
   evaluation of linkage (deduplication) processes.

- Example project modules and example data sets allowing simple
   running of Febrl projects without any modifications needed.

- An extensive 136 page manual.

Febrl, which is written is the free, open source Python programming
language, is itself available under a free, open source license, which
we hope will encourage others to contribute to its further development
and support. Contact details, background information, documentation
and, of course, the program code are all available from the project
Web site at

         http://datamining.anu.edu.au/linkage.html

  as well as from 'sourceforge.net' at

         http://sourceforge.net/projects/febrl

We would like to stress that the programs are still in the early
stages of development, and we do not yet recommend them for production
use, but we encourage you to try them and to provide us with feedback.

We particularly welcome bug reports and ideas for future development.
There are many ways to help with the project: testing, programming and
software engineering, testing, documentation and technical writing,
testing, translation, testing, provision of (anonymous, non-confidential)
training and example data sets, and testing (did we mention that already?).

We look forward to hearing from you.

Peter Christen and Tim Churches
Principal Developers of Febrl