Python's re module and genealogy problem

Michael Torrie torriem at gmail.com
Wed Jun 11 11:34:56 EDT 2014


On 06/11/2014 06:23 AM, BrJohan wrote:
> For some genealogical purposes I consider using Python's re module.
> 
> Rather many names can be spelled in a number of similar ways, and in 
> order to match names even if they are spelled differently, I will build 
> regular expressions, each of which is supposed to match  a number of 
> similar names.
> 
> I guess that there will be a few hundred such regular expressions 
> covering most popular names.
> 
> Now, my problem: Is there a way to decide whether any two - or more - of 
> those regular expressions will match the same string?
> 
> Or, stated a little differently:
> 
> Can it, for a pair of regular expressions be decided whether at least 
> one string matching both of those regular expressions, can be constructed?
> 
> If it is possible to make such a decision, then how? Anyone aware of an 
> algorithm for this?

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc.  Unfortunately
such an algorithm would be language dependent.  The problem you are
trying to solve is one of those very hard problems in computers and math.




More information about the Python-list mailing list