Parsing a potentially corrupted file

Paul Moore p.f.moore at gmail.com
Wed Dec 14 09:07:27 EST 2016


On Wednesday, 14 December 2016 12:57:23 UTC, Chris Angelico  wrote:
> Is the "[Component]" section something you could verify? (That is - is
> there a known list of components?) If so, I would include that as a
> secondary check. Ditto anything else you can check (I'm guessing the
> [level] is one of a small set of values too.)

Possibly, although this is to analyze the structure of a basically undocumented log format. So if I validate too tightly, I end up just checking my assumptions rather than checking the data :-(

> The logic would be
> something like this:
> 
> Read line from file.
> Verify line as a potential record:
>     Assert that line begins with timestamp.
>     Verify as many fields as possible (component, level, etc)
>     Search line for additional timestamp.
>     If additional timestamp found:
>         Recurse. If verification fails, assume we didn't really have a
> corrupted line.
>         (Process partial line? Or discard?)
>     If "[[" in line:
>         Until line is "]]":
>             Read line from file, append to description
>             If timestamp found:
>                 Recurse. If verification succeeds, break out of loop.
> 
>  Unfortunately it's still not really clean; but that's the nature of
> working with messy data. Coping with ambiguity is *hard*.

Yeah, that's essentially what I have now. As I say, it's working but nobody could really love it. But you're right, it's more the fault of the data than of the code.

One thought I had, which I might try, is to go with the timestamp as the one assumption I make of the data, and read the file in as, in effect, a text stream, spitting out a record every time I see something matching a the [timestamp] pattern. Then parse record by record. Truncated records should either be obvious (because the delimited fields have start and end markers, so unmatched markers = truncated record) or acceptable (because undelimited fields are free text). I'm OK with ignoring the possibility that the free text contains something that looks like a timestamp.

The only problem with this approach is that I have more data than I'd really like to read into memory all at once, so I'd need to do some sort of streamed match/split processing. But thinking about it, that sounds like the sort of job a series of chained generators could manage. Maybe I'll look at that approach...

Paul



More information about the Python-list mailing list