pyparsing question

Wed Jan 2 03:50:15 EST 2008

On Jan 1, 5:32 pm, hubritic <colinland... at gmail.com> wrote:
> I am trying to parse data that looks like this:
>
> IDENTIFIER    TIMESTAMP   T  C   RESOURCE_NAME   DESCRIPTION
> 2BFA76F6     1208230607   T   S   SYSPROC                    SYSTEM
> SHUTDOWN BY USER
> A6D1BD62   1215230807     I
> H                                            Firmware Event
>
<snip>

> The data I have has a fixed number of characters per field, so I could
> split it up that way, but wouldn't that defeat the purpose of using a
> parser?  

I think you have this backwards.  I use pyparsing for a lot of text
processing, but if it is not a good fit, or if str.split is all that
is required, there is no real rationale for using anything more
complicated.

> I am determined to become proficient with pyparsing so I am
> using it even when it could be considered overkill; thus, it has gone
> past mere utility now, this is a matter of principle!
>

Well, I'm glad you are driven to learn pyparsing if it kills you, but
John Machin has a good point.  This data is really so amenable to
something as simple as:

for line in logfile:
    id,timestamp,t,c resource_and_description = line.split(None,4)

that it is difficult to recommend pyparsing for this case.  The sample
you posted was space-delimited, but if it is tab-delimited, and there
is a pair of tabs between the "H" and "Firmware Event" on the second
line, then just use split("\t") for your data and be done.

Still, pyparsing may be helpful in disambiguating that RESOURCE_NAME
and DESCRIPTION text.  One approach would be to enumerate (if
possible) the different values of RESOURCE_NAME.  Something like this:

ident = Word(alphanums)
timestamp = Word(nums,exact=10)

# I don't know what these are, I'm just getting the values
# from the sample text you posted
t_field = oneOf("T I")
c_field = oneOf("S H")

# I'm just guessing here, you'll need to provide the actual
# values from your log file
resource_name = oneOf("SYSPROC USERPROC IOSUBSYS whatever")

logline = ident("identifier") + timestamp("time") + \
    t_field("T") + c_field("C") + \
    Optional(resource_name, default="")("resource") + \
    Optional(restOfLine, default="")("description")

Another tack to take might be to use a parse action on the resource
name, to verify the column position of the found token by using the
pyparsing method col:

def matchOnlyAtCol(n):
    def verifyCol(strg,locn,toks):
        if col(locn,strg) != n: raise
ParseException(strg,locn,"matched token not at column %d" % n)
    return verifyCol

resource_name = Word(alphas).setParseAction(matchOnlyAtCol(35))

This will only work if your data really is columnar - the example text
that you posted isn't.  (Hmm, I like that matchOnlyAtCol method, I
think I'll add that to the next release of pyparsing...)

Here are some similar parsers that might give you some other ideas:
http://pyparsing.wikispaces.com/space/showimage/httpServerLogParser.py
http://mail.python.org/pipermail/python-list/2005-January/thread.html#301450

In the second link, I made a similar remark, that pyparsing may not be
the first tool to try, but the variability of the input file made the
non-pyparsing options pretty hairy-looking with special case code, so
in the end, pyparsing was no more complex to use.

Good luck!
-- Paul