Is Python Suitable for Large Find & Replace Operations?

John Machin sjmachin at lexicon.net
Thu Jun 16 22:33:20 EDT 2005


rbt wrote:
> On Tue, 2005-06-14 at 11:34 +1000, John Machin wrote:
> 
>>rbt wrote:
>>
>>>Here's the scenario:
>>>
>>>You have many hundred gigabytes of data... possible even a terabyte or 
>>>two. Within this data, you have private, sensitive information (US 
>>>social security numbers) about your company's clients. Your company has 
>>>generated its own unique ID numbers to replace the social security numbers.
>>>
>>>Now, management would like the IT guys to go thru the old data and 
>>>replace as many SSNs with the new ID numbers as possible.
>>
>>This question is grossly OT; it's nothing at all to do with Python. 
>>However ....
>>
>>(0) Is this homework?
> 
> 
> No, it is not.
> 
> 
>>(1) What to do with an SSN that's not in the map?
> 
> 
> Leave it be.
> 
> 
>>(2) How will a user of the system tell the difference between "new ID 
>>numbers" and SSNs?
> 
> 
> We have documentation. The two sets of numbers (SSNs and new_ids) are
> exclusive of each other.
> 
> 
>>(3) Has the company really been using the SSN as a customer ID instead 
>>of an account number, or have they been merely recording the SSN as a 
>>data item? Will the "new ID numbers" be used in communication with the 
>>clients? Will they be advised of the new numbers? How will you handle 
>>the inevitable cases where the advice doesn't get through?
> 
> 
> My task is purely technical.

OK then, let's ignore the fact that the data is in a collection of Word 
& Excel files, and let's ignore the scale for the moment. Let's assume 
there are only 100 very plain text files to process, and only 1000 SSNs 
in your map, so it doesn't have to be very efficient.

Can you please write a few lines of Python that would define your task 
-- assume you have a line of text from an input file, show how you would 
  determine that it needed to be changed, and how you would change it.

> 
> 
>>(4) Under what circumstances will it not be possible to replace *ALL* 
>>the SSNs?
> 
> 
> I do not understand this question.

Can you determine from the data, without reference to the map, that a 
particular string of characters is an SSN?

If so, and it is not in the map, why can it not be *added* to the map 
with a new generated ID?

> 
> 
>>(5) For how long can the data be off-line while it's being transformed?
> 
> 
> The data is on file servers that are unused on weekends and nights.
> 
> 
>>
>>>You have a tab 
>>>delimited txt file that maps the SSNs to the new ID numbers. There are 
>>>500,000 of these number pairs.
>>
>>And what is the source of the SSNs in this file??? Have they been 
>>extracted from the data? How?
> 
> 
> That is irrelevant.

Quite the contrary. If they had been extracted from the data, it would 
mean that someone actually might have a vague clue about how those SSNs 
were represented in the data and might be able to tell you and you might 
be able to tell us so that we could help you.

> 
> 
>>>What is the most efficient way  to 
>>>approach this? I have done small-scale find and replace programs before, 
>>>but the scale of this is larger than what I'm accustomed to.
>>>
>>>Any suggestions on how to approach this are much appreciated.
>>
>>A sensible answer will depend on how the data is structured:
>>
>>1. If it's in a database with tables some of which have a column for 
>>SSN, then there's a solution involving SQL.
>>
>>2. If it's in semi-free-text files where the SSNs are marked somehow:
>>
>>---client header---
>>surname: Doe first: John initial: Q SSN:123456789 blah blah
>>or
>><ssn>123456789</ssn>
>>
>>then there's another solution which involves finding the markers ...
>>
>>3. If it's really free text, like
>>"""
>>File note: Today John Q. Doe telephoned to advise that his Social 
>>Security # is 123456789  not 987654321 (which is his wife's) and the soc 
>>sec numbers of his kids Bob & Carol are ....
>>"""
>>then you might be in some difficulty ... google("TREC")
>>
>>
>>AND however you do it, you need to be very aware of the possibility 
>>(especially with really free text) of changing some string of digits 
>>that's NOT an SSN.
> 
> 
> That's possible, but I think not probably.

You "think not probably" -- based on what? pious hope? what the customer 
has told you? inspection of the data?



More information about the Python-list mailing list