[Tutor] Parse Text File

spir denis.spir at free.fr
Thu Jun 11 12:40:49 CEST 2009


[Hope you don't mind I copy to the list. Not only it can help others, but pyparsing users read tutor, including Paul MacGuire (author).]

Le Thu, 11 Jun 2009 11:53:31 +0200,
Stefan Lesicnik <stefan at lsd.co.za> s'exprima ainsi:

[...]

I cannot really answer precisely for haven't used pyparsing for a while (*).

So, below are only some hints.

> Hi Denis,
> 
> Thanks for your input. So i decided i should use a pyparser and try it (im a
> relative python noob though!)
> 
> This is what i have so far...
> 
> import sys
> from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
> Combine, Literal, alphanums, Optional, OneOrMore, SkipTo, printables
> 
> text='''
> [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
>         {CVE-2009-0023 CVE-2009-1955}
>         [etch] - apr-util 1.2.7+dfsg-2+etch2
>         [lenny] - apr-util 1.2.12+dfsg-8+lenny2
> '''
> 
> date = Combine(Literal('[') + Word(nums, exact=2) + Word(alphas) +
> Word(nums, exact=4) + Literal(']'),adjacent=False)
> dsa = Combine(Word(alphanums) + Literal('-') + Word(nums, exact=4) +
> Literal('-') + Word(nums, exact=1),adjacent=False)
> app = Combine(OneOrMore(Word(printables)) + SkipTo(Literal('-')))
> desc = Combine(Literal('-') + ZeroOrMore(Word(alphas)) +
> SkipTo(Literal('\n')))
> cve = Combine(Literal('{') + OneOrMore(Literal('CVE') + Literal('-') +
> Word(nums, exact=4) + Literal('-') + Word(nums, exact=4)) )
> 
> record = date + dsa + app + desc + cve
> 
> fields = record.parseString(text)
> #fields = dsa.parseString(text)
> print fields
> 
> 
> What i get out of this is
> 
> ['[04Jun2009]', 'DSA-1812-1', 'apr-util ', '- several vulnerabilities',
> '{CVE-2009-0023']
> 
> Which i guess it heading towards the right track...

For sure! Rather impressing you could write this so fast. Hope my littel PEG grammar helped.
There seems to be some detail issues, such as in the app pattern I would write
   ...+ SkipTo(Literal(' - '))
Also, you could directly Suppress() probably useless delimiters such as [...] in date.

Think at post-parse funcs to transform and/or reformat nodes: search for setParseAction() and addParseAction() in the doc.

> I am unsure why I am not getting more than 1 CVE... I have the OneOrMore
> match for the CVE stuff...

This is due to Combine(), that glues (back) together matched string bits. To work safely, it disables the default separator-skipping behaviour of pyparsing. So that
   real = Combine(integral+fractional)
would correctly not match "1 .2". Right?
See a recent reply by Paul MacGuire about this topic on the pyparsing list http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-users and the pointer he gives there.
There are several ways to correctly cope with that.

> That being said, how does the parser scale across multiple lines and how
> will it know that its finished?

Basically, you probably should express line breaks explicitely, esp. because they seem to be part of the source format.
Otherwise, there is a func or method to define which chars should be skipped as separators (default is sp/tab if I remember well).

> Should i maybe look at getting the list first into one entry per line? (must
> be easier to parse then?)

What makes sense I guess is Group()-ing items that *conceptually* build a list. In your case, I see:
* CVS items inside {...}
* version entry lines ("[etch]...", "[lenny]...", ...)
* whole records at a higher level

> This parsing is a mini language in itself!

Sure! A kind of rather big & complex parsing language. Hard to know it all well (and I don't even speak of all builtin helpers, and even less of all what you can do by mixing ordinary python code inside the grammar/parser: a whole new field in parsing/processing).

> Thanks for your input :)

My pleasure...

> Stefan

Denis

(*) The reason is I'm developping my own parsing tool; see http://spir.wikidot.com/pijnu.
The guide is also intended as a parsing tutorial, it may help, but is not exactly up-to-date.
------
la vita e estrany


More information about the Tutor mailing list