[Tutor] Logfile Manipulation

Stephen Nelson-Smith sanelson at gmail.com
Mon Nov 9 06:41:12 CET 2009


I've got a large amount of data in the form of 3 apache and 3 varnish
logfiles from 3 different machines.  They are rotated at 0400.  The
logfiles are pretty big - maybe 6G per server, uncompressed.

I've got to produce a combined logfile for 0000-2359 for a given day,
with a bit of filtering (removing lines based on text match, bit of
substitution).

I've inherited a nasty shell script that does this but it is very slow
and not clean to read or understand.

I'd like to reimplement this in python.

Initial questions:

* How does Python compare in performance to shell, awk etc in a big
pipeline?  The shell script kills the CPU
* What's the best way to extract the data for a given time, eg 0000 -
2359 yesterday?

Any advice or experiences?

S.
-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com


More information about the Tutor mailing list