Is Python Suitable for Large Find & Replace Operations?

rbt rbt at athop1.ath.vt.edu
Thu Jun 16 09:27:37 EDT 2005


On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote:
> rbt wrote:
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible.
> 
> This question is grossly OT; it's nothing at all to do with Python. 
> However ....
> 
> (0) Is this homework?

No, it is not.

> 
> (1) What to do with an SSN that's not in the map?

Leave it be.

> 
> (2) How will a user of the system tell the difference between "new ID 
> numbers" and SSNs?

We have documentation. The two sets of numbers (SSNs and new_ids) are
exclusive of each other.

> 
> (3) Has the company really been using the SSN as a customer ID instead 
> of an account number, or have they been merely recording the SSN as a 
> data item? Will the "new ID numbers" be used in communication with the 
> clients? Will they be advised of the new numbers? How will you handle 
> the inevitable cases where the advice doesn't get through?

My task is purely technical.

> 
> (4) Under what circumstances will it not be possible to replace *ALL* 
> the SSNs?

I do not understand this question.

> 
> (5) For how long can the data be off-line while it's being transformed?

The data is on file servers that are unused on weekends and nights.

> 
> 
> > You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs.
> 
> And what is the source of the SSNs in this file??? Have they been 
> extracted from the data? How?

That is irrelevant.

> 
> > What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> A sensible answer will depend on how the data is structured:
> 
> 1. If it's in a database with tables some of which have a column for 
> SSN, then there's a solution involving SQL.
> 
> 2. If it's in semi-free-text files where the SSNs are marked somehow:
> 
> ---client header---
> surname: Doe first: John initial: Q SSN:123456789 blah blah
> or
> <ssn>123456789</ssn>
> 
> then there's another solution which involves finding the markers ...
> 
> 3. If it's really free text, like
> """
> File note: Today John Q. Doe telephoned to advise that his Social 
> Security # is 123456789  not 987654321 (which is his wife's) and the soc 
> sec numbers of his kids Bob & Carol are ....
> """
> then you might be in some difficulty ... google("TREC")
> 
> 
> AND however you do it, you need to be very aware of the possibility 
> (especially with really free text) of changing some string of digits 
> that's NOT an SSN.

That's possible, but I think not probably.





More information about the Python-list mailing list