[Tutor] find and replace multi-line strings in a directory tree
Fulvio Casali
fulviocasali at gmail.com
Wed Jul 11 07:08:21 CEST 2012
Hello Tutor,
First time poster here.
I wrote a little script to solve a find and replace problem, and I was
wondering if it could be improved, made more pythonic, and such.
The problem was that I had a directory tree with several thousand files,
about 2000 of which were static HTML (I blame my predecessor for this),
with the typical Google Analytics tracking code:
<script type="text/javascript">
>
> var _gaq = _gaq || [];
> _gaq.push(['_setAccount', 'UA-11111111-1']);
> _gaq.push(['_trackPageview']);
>
> (function() {
> var ga = document.createElement('script'); ga.type =
> 'text/javascript'; ga.async = true;
> ga.src = ('https:' == document.location.protocol ? 'https://ssl' : '
> http://www') + '.google-analytics.com/ga.js';
> var s = document.getElementsByTagName('script')[0];
> s.parentNode.insertBefore(ga, s);
> })();
>
> </script>
>
Now my client started worrying about the privacy of its users, and asked me
to remove all these snippets. I wrote a regex that would capture this
multi-line snippet
In addition, several HTML files also had event handlers calling the GA
tracking script, e.g:
<a href="crime_and_punishment_jp.pdf" onMouseDown="javascript:
> _gaq.push(['_trackEvent', 'Translations', 'Downloaded PDF', 'Crime and
> Punishment vol 2 (Japanese)']); ">
>
The one improvement I would have hoped to make was to use
fileinput.FileInput, but I couldn't figure out how to make it work beyond
the one-line-at-a-time scenario. It sure would be nice to hide all the
open(), read(), seek(), write(), truncate(), close() nonsense.
That said, my script works fine, and takes only a few seconds to traverse
the several thousand files.
Here is my script:
from os import walk
> from os.path import join
> from re import compile
>
> def handler(error):
> print error
>
> script_pattern=compile(r'<script[^<]*_gaq[^<]*</script>')
>
> event_pattern=compile(r'onMouseDown\s*=\s*"\s*javascript:\s*_gaq.push[^"]+"')
>
> modified = []
> event = []
>
> for root, dirs, files in walk('rc', onerror=handler):
> for name in files:
> path = join(root, name)
> file = open(path, 'r+b')
> all = file.read()
> file.seek(0)
> if script_pattern.search(all):
> modified.append(path)
> fixed = script_pattern.sub('', all)
> if event_pattern.search(fixed):
> event.append(path)
> fixed = event_pattern.sub('', fixed)
> file.write(fixed)
> file.truncate()
> file.close()
>
> for i in modified: print i
> print len(modified)
>
> for i in event: print i
> print len(event)
>
Thanks for any feedback!
--
Fulvio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120710/d96ede11/attachment.html>
More information about the Tutor
mailing list