Simple distributed example for learning purposes?

Mon Dec 28 07:59:54 EST 2009

On Dec 27, 2009, at 1:23 PM, Lie Ryan wrote:
> 
> IMHO, that's a poor example. Rather than writing a fuzzy search algorithm, it's easier to write a normalizer before entering data to the index (or before comparing the search string with the corpus' string).
> -- 
> 

It does seem like that at first, but it turns out that you can't normalize this data, for many reasons.

With address data:
	one address may have suite data and the other might not
	the same city may have multiple zip codes
	incoming addresses may be missing information
	typos are common
	sometimes "Route 35" is the same road as "Convery Boulevard"
	etc. etc. etc.

With names:
	you have to compare with and without the middle name
	compare with and without the title (Mrs., Dr., Mr., Ms.)
	compare with and without the suffix (PhD., Sr., Junior, III, etc.)
	typos are VERY common
	what if John Henry Smith goes by "Henry Smith"?
	what if Xu Wang goes by "John Wang" (happens all the time)
	maiden name versus married name
	etc. etc. etc.

This is a major, real-world issue that remains unsolved, and companies that do a decent job at it make millions of dollars a year from their clients. One of my old jobs made tens of millions a year (and growing FAST) in the  medical industry alone. 

Shawn