[Tutor] find and replace multi-line strings in a directory tree

Fulvio Casali fulviocasali at gmail.com
Wed Jul 11 07:08:21 CEST 2012


Hello Tutor,
First time poster here.

I wrote a little script to solve a find and replace problem, and I was
wondering if it could be improved, made more pythonic, and such.
The problem was that I had a directory tree with several thousand files,
about 2000 of which were static HTML (I blame my predecessor for this),
with the typical Google Analytics tracking code:

<script type="text/javascript">
>
>   var _gaq = _gaq || [];
>   _gaq.push(['_setAccount', 'UA-11111111-1']);
>   _gaq.push(['_trackPageview']);
>
>   (function() {
>     var ga = document.createElement('script'); ga.type =
> 'text/javascript'; ga.async = true;
>     ga.src = ('https:' == document.location.protocol ? 'https://ssl' : '
> http://www') + '.google-analytics.com/ga.js';
>     var s = document.getElementsByTagName('script')[0];
> s.parentNode.insertBefore(ga, s);
>   })();
>
> </script>
>

Now my client started worrying about the privacy of its users, and asked me
to remove all these snippets.  I wrote a regex that would capture this
multi-line snippet
In addition, several HTML files also had event handlers calling the GA
tracking script, e.g:

<a href="crime_and_punishment_jp.pdf" onMouseDown="javascript:
> _gaq.push(['_trackEvent', 'Translations', 'Downloaded PDF', 'Crime and
> Punishment vol 2 (Japanese)']); ">
>

The one improvement I would have hoped to make was to use
fileinput.FileInput, but I couldn't figure out how to make it work beyond
the one-line-at-a-time scenario.  It sure would be nice to hide all the
open(), read(), seek(), write(), truncate(), close() nonsense.

That said, my script works fine, and takes only a few seconds to traverse
the several thousand files.
Here is my script:

from os import walk
> from os.path import join
> from re import compile
>
> def handler(error):
>   print error
>
> script_pattern=compile(r'<script[^<]*_gaq[^<]*</script>')
>
> event_pattern=compile(r'onMouseDown\s*=\s*"\s*javascript:\s*_gaq.push[^"]+"')
>
> modified = []
> event = []
>
> for root, dirs, files in walk('rc', onerror=handler):
>   for name in files:
>     path = join(root, name)
>     file = open(path, 'r+b')
>     all = file.read()
>     file.seek(0)
>     if script_pattern.search(all):
>       modified.append(path)
>       fixed = script_pattern.sub('', all)
>       if event_pattern.search(fixed):
>         event.append(path)
>         fixed = event_pattern.sub('', fixed)
>       file.write(fixed)
>       file.truncate()
>     file.close()
>
> for i in modified: print i
> print len(modified)
>
> for i in event: print i
> print len(event)
>


Thanks for any feedback!

-- 
Fulvio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120710/d96ede11/attachment.html>


More information about the Tutor mailing list