pyparsing question
Paul McGuire
ptmcg at austin.rr.com
Wed Jan 2 03:50:15 EST 2008
On Jan 1, 5:32 pm, hubritic <colinland... at gmail.com> wrote:
> I am trying to parse data that looks like this:
>
> IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
> 2BFA76F6 1208230607 T S SYSPROC SYSTEM
> SHUTDOWN BY USER
> A6D1BD62 1215230807 I
> H Firmware Event
>
<snip>
> The data I have has a fixed number of characters per field, so I could
> split it up that way, but wouldn't that defeat the purpose of using a
> parser?
I think you have this backwards. I use pyparsing for a lot of text
processing, but if it is not a good fit, or if str.split is all that
is required, there is no real rationale for using anything more
complicated.
> I am determined to become proficient with pyparsing so I am
> using it even when it could be considered overkill; thus, it has gone
> past mere utility now, this is a matter of principle!
>
Well, I'm glad you are driven to learn pyparsing if it kills you, but
John Machin has a good point. This data is really so amenable to
something as simple as:
for line in logfile:
id,timestamp,t,c resource_and_description = line.split(None,4)
that it is difficult to recommend pyparsing for this case. The sample
you posted was space-delimited, but if it is tab-delimited, and there
is a pair of tabs between the "H" and "Firmware Event" on the second
line, then just use split("\t") for your data and be done.
Still, pyparsing may be helpful in disambiguating that RESOURCE_NAME
and DESCRIPTION text. One approach would be to enumerate (if
possible) the different values of RESOURCE_NAME. Something like this:
ident = Word(alphanums)
timestamp = Word(nums,exact=10)
# I don't know what these are, I'm just getting the values
# from the sample text you posted
t_field = oneOf("T I")
c_field = oneOf("S H")
# I'm just guessing here, you'll need to provide the actual
# values from your log file
resource_name = oneOf("SYSPROC USERPROC IOSUBSYS whatever")
logline = ident("identifier") + timestamp("time") + \
t_field("T") + c_field("C") + \
Optional(resource_name, default="")("resource") + \
Optional(restOfLine, default="")("description")
Another tack to take might be to use a parse action on the resource
name, to verify the column position of the found token by using the
pyparsing method col:
def matchOnlyAtCol(n):
def verifyCol(strg,locn,toks):
if col(locn,strg) != n: raise
ParseException(strg,locn,"matched token not at column %d" % n)
return verifyCol
resource_name = Word(alphas).setParseAction(matchOnlyAtCol(35))
This will only work if your data really is columnar - the example text
that you posted isn't. (Hmm, I like that matchOnlyAtCol method, I
think I'll add that to the next release of pyparsing...)
Here are some similar parsers that might give you some other ideas:
http://pyparsing.wikispaces.com/space/showimage/httpServerLogParser.py
http://mail.python.org/pipermail/python-list/2005-January/thread.html#301450
In the second link, I made a similar remark, that pyparsing may not be
the first tool to try, but the variability of the input file made the
non-pyparsing options pretty hairy-looking with special case code, so
in the end, pyparsing was no more complex to use.
Good luck!
-- Paul
More information about the Python-list
mailing list