Febrl-0.3 released

Peter Christen peter.christen at anu.edu.au
Fri Apr 8 00:56:08 CEST 2005


Canberra, 7 April 2005

The ANU Data Mining Group is pleased to announce the release of
Febrl 0.3, a prototype open source record linkage, deduplication
and geocoding system intended to make probabilistic record linkage
easier, faster and more accurate for biomedical and other
researchers.

The programs, known collectively as "Febrl" - Freely Extensible
Biomedical Record Linkage - address the data cleaning and
standardisation tasks which are essential first steps for most
record linkage projects, and provide routines for probabilistic
record linkage and record deduplication, as well as geocode
matching based on the Australian G-NAF (Geocoded National Address
File, www.g-naf.com.au) database.

This fifth release Febrl Version 0.3 has been updated to Python
2.4 (also runs on Python 2.3). We would like to thank everybody
who sent us bug-reports or other comments.

The main features of the current release are:

* Probabilistic and rules-based cleaning and standardisation
  routines for names, addresses, dates and telephone numbers.

* A geocoding matching system based on the Australian G-NAF
  (Geocoded National Address File) database.

* A variety of supplied look-up and frequency tables for names
   and addresses.

* Various comparison functions for names, addresses, dates and
   localities, including approximate string comparisons, phonetic
   encodings, geographical distance comparisons, and time and age
   comparisons. Two new approximate string comparison methods (bag
   distance and compression based) have been added in this release.

* Several blocking (indexing) methods, including the traditional
   compound key blocking used in many record linkage programs.

* Probabilistic record linkage routines based on the classical
   Fellegi and Sunter approach, as well as a 'flexible classifier'
   that allows a flexible definition of the weight calculation.

* Process indicators that give estimations of remaining processing
   times.

* Access methods for fixed format and comma-separated value (CSV)
   text files, as well as SQL databases (MySQL and new PostgreSQL).

* Efficient temporary direct random access data set based on the
   Berkeley database library.

* Possibility to save linkage and deduplication results into a
   comma-separated value (CSV) text file (new).

* One-to-one assignment procedure for linked record pairs based on
   the 'Auction' algorithm.

* Supports parallelism for higher performance on parallel plat-
   forms, based on MPI (Message Passing Interface), a standard for
   parallel programming, and Pypar, an efficient and easy-to-use
   module that allows Python programs to run in parallel on
   multiple processors and communicate using MPI.

* A data set generator which allows the creation of data sets of
   randomly generated records (containing names, addresses, dates,
   and phone and identifier numbers), with the possibility to
   include duplicate records with randomly introduced
   modifications. This allows for easy testing and evaluation of
   linkage (deduplication) processes.

* Example project modules and example data sets allowing simple
   running of Febrl projects without any modifications needed.

- An extensive 185 page manual.

Febrl, which is written in the free open source Python programming
language, is itself available under a free, open source license,
which we hope will encourage others to contribute to its further
development and support. Contact details, background information,
documentation and, of course, the program code are all available
from the project Web site at

         http://datamining.anu.edu.au/linkage.html

  as well as from 'sourceforge.net' at

         http://sourceforge.net/projects/febrl

We would like to stress that the programs are still in the early
stages of development, and we do not yet recommend them for
production use, but we encourage you to try them and to provide us
with feedback.

We particularly welcome bug reports and ideas for future
development. There are many ways to help with the project:
testing, programming and software engineering, documentation and
technical writing, translation, provision of (anonymous,
non-confidential) training and example data sets, and testing.


For the Febrl team,
Peter Christen

=================================================
Dr Peter Christen
Lecturer / Graduate Advisor
Department of Computer Science
Faculty of Engineering and Information Technology
CSIT Building (108), North Road
The Australian National University
Canberra ACT 0200 Australia

T: +61 2 6125 5690
F: +61 2 6125 0010
W: http://cs.anu.edu.au/~Peter.Christen

CRICOS Provider #00120C


More information about the Python-announce-list mailing list