pyparsing: match empty line

Wed Sep 3 01:14:51 EDT 2008

On Sep 2, 11:38 am, Marek Kubica <ma... at xivilization.net> wrote:
> Hi,
>
> I am trying to get this stuff working, but I still fail.
>
> I have a format which consists of three elements:
> \d{4}M?-\d (4 numbers, optional M, dash, another number)
> EMPTY (the <EMPTY> token)
> [Empty line] (the <PAGEBREAK> token. The line may contain whitespaces,
> but nothing else)
>

<snip>

Marek -

Here are some refinements to your program that will get you closer to
your posted results.

1) Well done in resetting the default whitespace characters, since you
are doing some parsing that is dependent on the presence of line
ends.  When you do this, it is useful to define an expression for end
of line so that you can reference it where you explicitly expect to
find line ends:

    EOL = LineEnd().suppress()

2) Your second test fails because there is an EOL between the two
watchnames.  Since you have removed EOL from the set of default
whitespace characters (that is, whitespace that pyparsing will
automatically skip over), then pyparsing will stop after reading the
first watchname.  I think that you want EOLs to get parsed if nothing
else matches, so you can add it to the end of your grammar definition:

    parser = OneOrMore(watchname ^ pagebreak ^ leaveempty ^ EOL)

This will now permit the second test to pass.

3) Your definition of pagebreak looks okay now, but I don't understand
why your test containing 2 blank lines is only supposed to generate a
single <PAGEBREAK>.

    pagebreak = LineStart() +
LineEnd().setParseAction(replaceWith('<PAGEBREAK>'))

If you really want to only get a single <PAGEBREAK> from your test
case, than change pagebreak to:

    pagebreak = OneOrMore(LineStart() +
LineEnd()).setParseAction(replaceWith('<PAGEBREAK>'))

4) leaveempty probably needs this parse action to be attached to it:

    leaveempty =
Literal('EMPTY').setParseAction(replaceWith('<EMPTY>'))

5) (optional) Your definition of parser uses '^' operators, which
translate into Or expressions.  Or expressions evaluate all the
alternatives, and then choose the longest match.  The expressions you
have don't really have any ambiguity to them, and could be evaluated
using:

    parser = OneOrMore(watchname | pagebreak | leaveempty | EOL)

'|' operators generate MatchFirst expressions.  MatchFirst will do
short-circuit evaluation - the first expression that matches will be
the one chosen as the matching alternative.

If you have more pyparsing questions, you can also post them on the
pyparsing wiki - the Discussion tab on the wiki Home page has become a
running support forum - and there is also a Help/Discussion mailing
list.

Cheers,
-- Paul