Fuzzy matching of postal addresses

Tim Churches tchur at optushome.com.au
Fri Feb 18 01:43:19 EST 2005


McBooCzech wrote:

>Tim,
>
>do you think Ferbel can parse properly with non English data-sets?
>
The official name for the project is "Febrl" (freely-extensible 
biomedical record linkage) but perhaps "Furball" would be better name, 
given its focus on fuzziness (if that is not a contradiction in terms). 
My cat could cough up a suitable graphical logo, no doubt.

> I
>mean do you think it will work properly with data they include non
>English characters as well? As we live in Europe, we have to solve such
>a problems here :)))) If the software needs some changes, I am ready,
>according to your suggestions, to try to do it.
>  
>
The underlying method (hidden Markov models) should work with any kind 
of data. If the non-English characters can be read as normal Python 
strings then it will definitely work. If you need to use Unicode 
strings, it may work, but that is not something we have tested - 
although it is on the TO-DO list. Peter Christen could comment further 
on that (off-line).  You would need to supply your own  tagging look-up 
tables of course, and bootstrap your own models as it unlikely that the 
sample models for Australian addresses will work well for Czech 
addresses. But none of that should be too hard.

Tim C




More information about the Python-list mailing list