Is Python Suitable for Large Find & Replace Operations?
rbt
rbt at athop1.ath.vt.edu
Fri Jun 17 08:50:14 EDT 2005
On Fri, 2005-06-17 at 12:33 +1000, John Machin wrote:
> OK then, let's ignore the fact that the data is in a collection of Word
> & Excel files, and let's ignore the scale for the moment. Let's assume
> there are only 100 very plain text files to process, and only 1000 SSNs
> in your map, so it doesn't have to be very efficient.
>
> Can you please write a few lines of Python that would define your task
> -- assume you have a line of text from an input file, show how you would
> determine that it needed to be changed, and how you would change it.
The script is too long to post in its entirety. In short, I open the
files, do a binary read (in 1MB chunks for ease of memory usage) on them
before placing that read into a variable and that in turn into a list
that I then apply the following re to
ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
like this:
for chunk in whole_file:
search = ss.findall(chunk)
if search:
validate(search)
The validate function makes sure the string found is indeed in the range
of a legitimate SSN. You may read about this range here:
http://www.ssa.gov/history/ssn/geocard.html
That is as far as I have gotten. And I hope you can tell that I have
placed some small amount of thought into the matter. I've tested the
find a lot and it is rather accurate in finding SSNs in files. I have
not yet started replacing anything. I've only posted here for advice
before beginning.
>
> >
> >
> >>(4) Under what circumstances will it not be possible to replace *ALL*
> >>the SSNs?
> >
> >
> > I do not understand this question.
>
> Can you determine from the data, without reference to the map, that a
> particular string of characters is an SSN?
See above.
>
> If so, and it is not in the map, why can it not be *added* to the map
> with a new generated ID?
It is not my responsibility to do this. I do not have that authority
within the organization. Have you never worked for a real-world business
and dealt with office politics and territory ;)
> >>And what is the source of the SSNs in this file??? Have they been
> >>extracted from the data? How?
> >
> >
> > That is irrelevant.
>
> Quite the contrary. If they had been extracted from the data,
They have not. They are generated by a higher authority and then used by
lower authorities such as me.
Again though, I think this is irrelevant to the task at hand... I have a
map, I have access to the data and that is all I need to have, no? I do
appreciate your input though. If you would like to have a further
exchange of ideas, perhaps we should do so off list?
More information about the Python-list
mailing list