Is Python Suitable for Large Find & Replace Operations?

Robert Kern rkern at ucsd.edu
Mon Jun 13 14:22:01 EDT 2005


rbt wrote:
> Here's the scenario:
> 
> You have many hundred gigabytes of data... possible even a terabyte or 
> two. Within this data, you have private, sensitive information (US 
> social security numbers) about your company's clients. Your company has 
> generated its own unique ID numbers to replace the social security numbers.
> 
> Now, management would like the IT guys to go thru the old data and 
> replace as many SSNs with the new ID numbers as possible. You have a tab 
> delimited txt file that maps the SSNs to the new ID numbers. There are 
> 500,000 of these number pairs. What is the most efficient way  to 
> approach this? I have done small-scale find and replace programs before, 
> but the scale of this is larger than what I'm accustomed to.
> 
> Any suggestions on how to approach this are much appreciated.

I'm just tossing this out. I don't really have much experience with the 
algorithm, but when searching for multiple targets, an Aho-Corasick 
automaton might be appropriate.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Good luck!

-- 
Robert Kern
rkern at ucsd.edu

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter




More information about the Python-list mailing list