Is Python Suitable for Large Find & Replace Operations?

Thu Jun 16 09:21:42 EDT 2005

On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
rbt at athop1.ath.vt.edu bounced, that is the reason why.

The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx     new-id-xxxx
ssn-xx-xxxx     new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt