Parsing a potentially corrupted file

alister alister.ware at ntlworld.com
Wed Dec 14 08:38:38 EST 2016


On Wed, 14 Dec 2016 03:43:44 -0800, Paul  Moore wrote:

> I'm looking for a reasonably "clean" way to parse a log file that
> potentially has incomplete records in it.
> 
> The basic structure of the file is a set of multi-line records. Each
> record starts with a series of fields delimited by [...] (the first of
> which is always a date), optionally separated by whitespace. Then
> there's a trailing "free text" field, optionally followed by a
> multi-line field delimited by [[...]]
> 
> So, example records might be
> 
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here
> 
> (a record delimited by the end of the line)
> 
> or
> 
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here [[Additional data, potentially
> multiple lines
> 
> including blank lines goes here ]]
> 
> The terminating ]] is on a line of its own.
> 
> This is a messy format to parse, but it's manageable. However, there's a
> catch. Because the logging software involved is broken, I can
> occasionally get a log record prematurely terminated with a new record
> starting mid-stream. So something like the following:
> 
> [2016-11-30T20:04:08.000+00:00] [Component]
> [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here
> 
> I'm struggling to find a "clean" way to parse this. I've managed a
> clumsy approach, by splitting the file contents on the pattern
> [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case
> where this gets truncated) and then treating each entry as a record and
> parsing it individually. But the resulting code isn't exactly
> maintainable, and I'm looking for something cleaner.
> 
> Does anyone have any suggestions for a good way to parse this data?
> 
> Thanks,
> Paul

1st question do you (or anyone you can contact) have any control over the 
logging application?

if so the best approach would be to get the log file output fixed.

if not then you will probably be stuck with a messy solution :-(



-- 
Sin has many tools, but a lie is the handle which fits them all.



More information about the Python-list mailing list