Is Python Suitable for Large Find & Replace Operations?

rbt rbt at athop1.ath.vt.edu
Fri Jun 17 08:50:14 EDT 2005


On Fri, 2005-06-17 at 12:33 +1000, John Machin wrote:

> OK then, let's ignore the fact that the data is in a collection of Word 
> & Excel files, and let's ignore the scale for the moment. Let's assume 
> there are only 100 very plain text files to process, and only 1000 SSNs 
> in your map, so it doesn't have to be very efficient.
> 
> Can you please write a few lines of Python that would define your task 
> -- assume you have a line of text from an input file, show how you would 
>   determine that it needed to be changed, and how you would change it.

The script is too long to post in its entirety. In short, I open the
files, do a binary read (in 1MB chunks for ease of memory usage) on them
before placing that read into a variable and that in turn into a list
that I then apply the following re to

ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

like this:

for chunk in whole_file:
    search = ss.findall(chunk)
    if search:
        validate(search)

The validate function makes sure the string found is indeed in the range
of a legitimate SSN. You may read about this range here:

http://www.ssa.gov/history/ssn/geocard.html

That is as far as I have gotten. And I hope you can tell that I have
placed some small amount of thought into the matter. I've tested the
find a lot and it is rather accurate in finding SSNs in files. I have
not yet started replacing anything. I've only posted here for advice
before beginning.

> 
> > 
> > 
> >>(4) Under what circumstances will it not be possible to replace *ALL* 
> >>the SSNs?
> > 
> > 
> > I do not understand this question.
> 
> Can you determine from the data, without reference to the map, that a 
> particular string of characters is an SSN?

See above.

> 
> If so, and it is not in the map, why can it not be *added* to the map 
> with a new generated ID?

It is not my responsibility to do this. I do not have that authority
within the organization. Have you never worked for a real-world business
and dealt with office politics and territory ;)

> >>And what is the source of the SSNs in this file??? Have they been 
> >>extracted from the data? How?
> > 
> > 
> > That is irrelevant.
> 
> Quite the contrary. If they had been extracted from the data,

They have not. They are generated by a higher authority and then used by
lower authorities such as me. 

Again though, I think this is irrelevant to the task at hand... I have a
map, I have access to the data and that is all I need to have, no? I do
appreciate your input though. If you would like to have a further
exchange of ideas, perhaps we should do so off list?




More information about the Python-list mailing list