Is Python Suitable for Large Find & Replace Operations?

Mon Jun 13 14:24:09 EDT 2005

rbt  <rbt at athop1.ath.vt.edu> wrote:
>You have many hundred gigabytes of data... possible even a terabyte or 
>two. Within this data, you have private, sensitive information (US 
>social security numbers) about your company's clients. Your company has 
>generated its own unique ID numbers to replace the social security numbers.
>
>Now, management would like the IT guys to go thru the old data and 
>replace as many SSNs with the new ID numbers as possible.

The first thing I would do is a quick sanity check.  Write a simple
Python script which copies its input to its output with minimal
processing.  Something along the lines of:

for line in sys.stdin.readline():
    words = line.split()
    print words

Now, run this on a gig of your data on your system and time it.
Multiply the timing by 1000 (assuming a terabyte of real data), and
now you've got a rough lower bound on how long the real job will take.
If you find this will take weeks of processing time, you might want to
re-think the plan.  Better to discover that now than a few weeks into
the project :-)

>You have a tab 
>delimited txt file that maps the SSNs to the new ID numbers. There are 
>500,000 of these number pairs. What is the most efficient way  to 
>approach this? I have done small-scale find and replace programs before, 
>but the scale of this is larger than what I'm accustomed to.

The obvious first thing is to read the SSN/ID map file and build a
dictionary where the keys are the SSN's and the values are the ID's.

The next step depends on your data.  Is it in text files?  An SQL
database?  Some other format.  One way or another you'll have to read
it in, do something along the lines of "newSSN = ssnMap[oldSSN]", and
write it back out.