Is Python Suitable for Large Find & Replace Operations?
rbt
rbt at athop1.ath.vt.edu
Thu Jun 16 09:27:37 EDT 2005
On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote:
> rbt wrote:
> > Here's the scenario:
> >
> > You have many hundred gigabytes of data... possible even a terabyte or
> > two. Within this data, you have private, sensitive information (US
> > social security numbers) about your company's clients. Your company has
> > generated its own unique ID numbers to replace the social security numbers.
> >
> > Now, management would like the IT guys to go thru the old data and
> > replace as many SSNs with the new ID numbers as possible.
>
> This question is grossly OT; it's nothing at all to do with Python.
> However ....
>
> (0) Is this homework?
No, it is not.
>
> (1) What to do with an SSN that's not in the map?
Leave it be.
>
> (2) How will a user of the system tell the difference between "new ID
> numbers" and SSNs?
We have documentation. The two sets of numbers (SSNs and new_ids) are
exclusive of each other.
>
> (3) Has the company really been using the SSN as a customer ID instead
> of an account number, or have they been merely recording the SSN as a
> data item? Will the "new ID numbers" be used in communication with the
> clients? Will they be advised of the new numbers? How will you handle
> the inevitable cases where the advice doesn't get through?
My task is purely technical.
>
> (4) Under what circumstances will it not be possible to replace *ALL*
> the SSNs?
I do not understand this question.
>
> (5) For how long can the data be off-line while it's being transformed?
The data is on file servers that are unused on weekends and nights.
>
>
> > You have a tab
> > delimited txt file that maps the SSNs to the new ID numbers. There are
> > 500,000 of these number pairs.
>
> And what is the source of the SSNs in this file??? Have they been
> extracted from the data? How?
That is irrelevant.
>
> > What is the most efficient way to
> > approach this? I have done small-scale find and replace programs before,
> > but the scale of this is larger than what I'm accustomed to.
> >
> > Any suggestions on how to approach this are much appreciated.
>
> A sensible answer will depend on how the data is structured:
>
> 1. If it's in a database with tables some of which have a column for
> SSN, then there's a solution involving SQL.
>
> 2. If it's in semi-free-text files where the SSNs are marked somehow:
>
> ---client header---
> surname: Doe first: John initial: Q SSN:123456789 blah blah
> or
> <ssn>123456789</ssn>
>
> then there's another solution which involves finding the markers ...
>
> 3. If it's really free text, like
> """
> File note: Today John Q. Doe telephoned to advise that his Social
> Security # is 123456789 not 987654321 (which is his wife's) and the soc
> sec numbers of his kids Bob & Carol are ....
> """
> then you might be in some difficulty ... google("TREC")
>
>
> AND however you do it, you need to be very aware of the possibility
> (especially with really free text) of changing some string of digits
> that's NOT an SSN.
That's possible, but I think not probably.
More information about the Python-list
mailing list