clean data [was: Simple distributed example for learning purposes?]

Ethan Furman ethan at stoneleaf.us
Mon Dec 28 17:14:09 EST 2009


Lie Ryan wrote:
> On 12/28/2009 11:59 PM, Shawn Milochik wrote:
>> With address data:
>>     one address may have suite data and the other might not
>>     the same city may have multiple zip codes
> 
> why is that even a problem?  You do put suite data and zipcode into 
> different database fields right?

The issue here is not proper database design, the issue is users -- not 
one user, not two users, but millions of users, with no consistency 
amongst them.  They bring you their nice tiny list of 10,000 names and 
addresses and want you to correct/normalize/mail them, and you have to 
be able to break down what they gave you into something usable.

To rephrase, the issue that Shawn is referring to is the huge amount of 
data *already out there*, not brand new data.

~Ethan~



More information about the Python-list mailing list