clean data [was: Simple distributed example for learning purposes?]

Tue Dec 29 16:07:24 EST 2009

On 12/29/2009 9:14 AM, Ethan Furman wrote:
> Lie Ryan wrote:
>> On 12/28/2009 11:59 PM, Shawn Milochik wrote:
>>> With address data:
>>> one address may have suite data and the other might not
>>> the same city may have multiple zip codes
>>
>> why is that even a problem? You do put suite data and zipcode into
>> different database fields right?
>
> The issue here is not proper database design, the issue is users -- not
> one user, not two users, but millions of users, with no consistency
> amongst them. They bring you their nice tiny list of 10,000 names and
> addresses and want you to correct/normalize/mail them, and you have to
> be able to break down what they gave you into something usable.
>
> To rephrase, the issue that Shawn is referring to is the huge amount of
> data *already out there*, not brand new data.
>
> ~Ethan~

The way Shawn describes the problem appears like he has problem with 
"searching the database". I said, given a good index searching should 
never be the problem. To make the "good index", you've got a slightly 
different but easier problem to tackle: parsing. I realize that parsing 
inconsistent data isn't as easy as talking about it; but it's generally 
much easier than trying to fuzzily grep through the raw data.