Parsing a potentially corrupted file

Paul Moore p.f.moore at gmail.com
Wed Dec 14 06:43:44 EST 2016


I'm looking for a reasonably "clean" way to parse a log file that potentially has incomplete records in it.

The basic structure of the file is a set of multi-line records. Each record starts with a series of fields delimited by [...] (the first of which is always a date), optionally separated by whitespace. Then there's a trailing "free text" field, optionally followed by a multi-line field delimited by [[...]]

So, example records might be

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here

(a record delimited by the end of the line)

or 

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here [[Additional
data, potentially multiple lines

including blank lines
goes here
]]

The terminating ]] is on a line of its own.

This is a messy format to parse, but it's manageable. However, there's a catch. Because the logging software involved is broken, I can occasionally get a log record prematurely terminated with a new record starting mid-stream. So something like the following:

[2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here

I'm struggling to find a "clean" way to parse this. I've managed a clumsy approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case where this gets truncated) and then treating each entry as a record and parsing it individually. But the resulting code isn't exactly maintainable, and I'm looking for something cleaner.

Does anyone have any suggestions for a good way to parse this data?

Thanks,
Paul



More information about the Python-list mailing list