Is Python Suitable for Large Find & Replace Operations?

Fri Jun 17 17:49:27 EDT 2005

rbt wrote:
> On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote:
> 
>>rbt wrote:
>>
>>>The script is too long to post in its entirety. In short, I open the
>>>files, do a binary read (in 1MB chunks for ease of memory usage) on them

*ONE* MB? What are the higher-ups constraining you to run this on? My 
reference points: (1) An off-the-shelf office desktop PC from 
HP/Dell/whoever with 512MB of memory is standard fare. Motherboards with 
4 memory slots (supporting up to 4GB) are common. (2) Reading files in 
1MB chunks seemed appropriate when I was running an 8Mb 486 using Win 3.11.

Are you able to run your script on the network server itself? And what 
is the server setup? Windows/Novell/Samba/??? What version of Python are 
you running? Have you considered mmap?

>>>before placing that read into a variable and that in turn into a list
>>>that I then apply the following re to

In the following code, which is "variable" (chunk?) and which is the 
"list" to which you apply the re? The only list I see is the confusingly 
named "search" which is the *result* of applying re.findall.

>>>
>>>ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

Users never use spaces or other punctuation (or no punctuation) instead 
of dashes?

False positives? E.g. """
Dear X, Pls rush us qty 20 heart pacemaker alarm trigger circuits, part 
number 123-456-78-9012, as discussed.
"""

>>>
>>>like this:
>>>
>>>for chunk in whole_file:
>>>    search = ss.findall(chunk)

You will need the offset to update, so better get used to using 
something like finditer now. You may well find that "validate" may want 
to check some context on either side, even if it's just one character -- 
\b may turn out to be inappropriate.

>>>    if search:
>>>        validate(search)
>>
>>This seems so obvious that I hesitate to ask, but is the above really a 
>>simplification of the real code, which actually handles the case of SSNs 
>>that lie over the boundary between chunks? In other words, what happens 
>>if the first chunk has only the first four digits of the SSN, and the 
>>rest lies in the second chunk?
>>
>>-Peter
> 
> 
> No, that's a good question. As of now, there is nothing to handle the
> scenario that you bring up. I have considered this possibility (rare but
> possible). I have not written a solution for it. It's a very good point
> though.

How rare? What is the average size of file? What is the average 
frequency of SSNs per file? What is the maximum file size (you did 
mention mostly MS Word and Excel, so can't be very big)?

N.B. Using mmap just blows this problem away. Using bigger memory chunks 
diminishes it.

And what percentage of files contain no SSNs at all?

Are you updating in-situ, or must your script copy the files, changing 
SSNs as it goes? Do you have a choice between the two methods?

Are you allowed to constrain the list of input files so that you don't 
update binaries (e.g.) exe files?

> 
> Is that not why proper software is engineered? Anyone can build a go
> cart (write a program), but it takes a team of engineers and much
> testing to build a car, no? Which woulu you rather be riding in during a
> crash? I wish upper mgt had a better understanding of this ;)
> 

Don't we all so wish? Meanwhile, back in the real world, ...