Python's re module and genealogy problem

Dan Sommers dan at tombstonezero.net
Sat Jun 14 01:14:50 EDT 2014


On Fri, 13 Jun 2014 17:17:06 +0200, BrJohan wrote:

> Or to put the namevariants in some sequence of sets having elements
> like:  ("Kristina", "Christina", "Cristine", "Kristine")

> Matching is then just applying the 'in' operator.

That's definitely a better approach, for the reasons you mentioned.

> Comments?

A soundex (or similar) algorithm will be better in the long run for the
less common, but more often misspelled names.  It's fairly simple to
guess at a number of common spellings for names that *you* think are
common now, but what about names that run in families that aren't yours,
or aren't that common outside of that family, or were wildly popular a
couple of hundred years ago but have fallen out of favor now?

My wife's ancestors (she's the genealogist, I just get to hear the
horror stories) are notorious for being somewhat illiterate; for
changing their names, on purpose, after a feud, in order to "distance"
themselves from their relatives; and also for using not-common-now (or
even not-so-common-then) names.  Add in somewhat illiterate records
keepers and hospital workers (or midwives or neighbors), not to mention
bad copies of bad copies of centuries-old smudged documents, and you
have an instant soup of names that sound alike but are spelled
differently in ways you cannot guess ahead of time.

Your users will appreciate *some* sort of fuzzy matching, or runtime
extensibility, atop the "obvious" spellings you take the time to include
in your software.  And that's *not* a comment on your abilities; it's a
comment on the abilities and creativity of their ancestors.

Dan



More information about the Python-list mailing list