pyparsing wrong output

Sat Feb 13 00:28:03 EST 2010

On Feb 12, 6:41 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Fri, 12 Feb 2010 10:41:40 -0300, Eknath Venkataramani  
> <eknath.i... at gmail.com> escribió:
>
> > I am trying to write a parser in pyparsing.
> > Help Me.http://paste.pocoo.org/show/177078/is the code and this is  
> > input
> > file:http://paste.pocoo.org/show/177076/.
> > I get output as:
> > <generator object at 0xb723b80c>
>
> There is nothing wrong with pyparsing here. scanString() returns a  
> generator, like this:
>
> py> g = (x for x in range(20) if x % 3 == 1)
> py> g
> <generator object <genexpr> at 0x00E50D78>
>
Unfortunately, your grammar doesn't match the input text, so your
generator doesn't return anything.

I think you are taking sort of brute force approach to this problem,
and you need to think a little more abstractly.  You can't just pick a
fragment and then write an expression for it, and then the next and
then stitch them together - well you *can* but it helps to think both
abstract and concrete at the same time.

With the exception of your one key of "\'", this is a pretty basic
recursive grammar.  Recursive grammars are a little complicated to
start with, so I'll start with a non-recursive part.  And I'll work
more bottom-up or inside-out.

Let's start by looking at these items:

    count => 8,
    baajaar => 0.87628353,
    kiraae => 0.02341598,
    lii => 0.02178813,
    adr => 0.01978462,
    gyiimn => 0.01765590,

Each item has a name (which you called "eng", so I'll keep that
expression), a '=>' and *something*.  In the end, we won't really care
about the '=>' strings, they aren't really part of the keys or the
associated values, they are just delimiting strings - they are
important during parsing, but afterwards we don't really care about
them.  So we'll start with a pyparsing expression for this:

    keyval = eng + Suppress('=>') + something

Sometimes, the something is an integer, sometimes it's a floating
point number.  I'll define some more generic forms for these than your
original number, and a separate expression for a real number:

    integer = Combine(Optional('-') + Word(nums))
    realnum = Combine(Optional('-') + Word(nums) + '.' + Word(nums))

When we parse for these two, we need to be careful to check for a
realnum before an integer, so that we don't accidentally parse the
leading of "3.1415" as the integer "3".

    something = realnum | integer

So now we can parse this fragment using a delimitedList expression
(which takes care of the intervening commas, and also suppresses them
from the results:

    filedata = """
        count => 8,
        baajaar => 0.87628353,
        kiraae => 0.02341598,
        lii => 0.02178813,
        adr => 0.01978462,
        gyiimn => 0.01765590,"""
    print delimitedList(keyval).parseString(filedata)

Gives:
    ['count', '8', 'baajaar', '0.87628353', 'kiraae', '0.02341598',
     'lii', '0.02178813', 'adr', '0.01978462', 'gyiimn', '0.01765590']

Right off the bat, we see that we want a little more structure to
these results, so that the keys and values are grouped naturally by
the parser.  The easy way to do this is with Group, as in:

    keyval = Group(eng + Suppress('=>') + something)

With this one change, we now get:

    [['count', '8'], ['baajaar', '0.87628353'],
     ['kiraae', '0.02341598'], ['lii', '0.02178813'],
     ['adr', '0.01978462'], ['gyiimn', '0.01765590']]

Now we need to add the recursive part of your grammar.  A nested input
looks like:

    confident => {
      count => 4,
      trans => {
        ashhvsht => 0.75100505,
        phraarmnbh => 0.08341708,
        },
    },

So in addition to integers and reals, our "something" could also be a
nested list of keyvals:

    something = realnum | integer | (lparen + delimitedList(keyval) +
rparen)

This is *almost* right, with just a couple of tweaks:
- the list of keyvals may have a comma after the last item before the
closing '}'
- we really want to suppress the opening and closing braces (lparen
and rparen)
- for similar structure reasons, we'll enclose the list of keyvals in
a Group to retain the data hierarchy

    lparen,rparen = map(Suppress, "{}")
    something = realnum | integer |
        Group(lparen + delimitedList(keyval) + Optional(',') + rparen)

The recursive problem is that we have defined keyval using something,
and something using keyval.  You can't do that in Python.  So we use
the pyparsing class Forward to "forward" declare something:

    something = Forward()
    keyval = Group(eng + Suppress('=>') + something)

To define something as a Forward, we use the '<<' shift operator:

    something << (realnum | integer |
        Group(lparen + delimitedList(keyval) + Optional(',') +
rparen))

Our grammar now looks like:

    lparen,rparen = map(Suppress, "{}")

    something = Forward()
    keyval = Group(eng + Suppress('=>') + something)
    something << (realnum | integer |
        Group(lparen + delimitedList(keyval) + Optional(',') +
rparen))

To parse your entire input file, use a delimitedList(keyval)

    results = delimitedList(keyval).parseString(filedata)

(There is one problem - one of your keynames is "\'".  I don't know if
this is a typo or intentional.  If you need to accommodate even this
as a keyname, just change your definition of eng to Word(alphas
+r"\'").)

Now if I parse your original string, I get (using the pprint module to
format the results):

    [['markets',
      [['count', '8'],
       ['trans',
        [['baajaar', '0.87628353'],
         ['kiraae', '0.02341598'],
         ['lii', '0.02178813'],
         ['adr', '0.01978462'],
         ['gyiimn', '0.01765590'],
         ['baaaaromn', '0.01765590'],
         ['sdk', '0.01728024'],
         ['kaanuun', '0.00613574'],
         ',']],
       ',']],
     ['confident',
      [['count', '4'],
       ['trans',
        [['ashhvsht', '0.75100505'],
         ['phraarmnbh', '0.08341708'],
         ['athmvishhvaas', '0.08090452'],
         ['milte', '0.03768845'],
         ['utnii', '0.02110553'],
         ['anaa', '0.01432161'],
         ['jitne', '0.01155779'],
         ',']],
       ',']],
     ['consumers',
      [['count', '34'],
       ['trans',
        [['upbhokhtaaomn', '0.48493883'],
         ['upbhokhtaa', '0.27374792'],
         ['zrurtomn', '0.02753605'],
         ['suuchnaa', '0.02707965'],
         ['ghraahkomn', '0.02580174'],
         ['ne', '0.02574089'],
         ["\\'", '0.01947301'],
         ['jnmt', '0.01527414'],
         ',']],
       ',']]]

But there is one more card up pyparsing's sleeve.  Just as your
original parser used "english" to apply a results name to your keys,
it would be nice if our parser would return not a list of key-value
pairs, but an actual dict-like object.  Pyparsing's Dict class
enhances the results in just this way.  Use Dict to wrap our
repetitive structures, and it will automatically define results names
for us, reading the first element of each group as the key, and the
remaining items in the group as the value:

    something << (realnum | integer |
        Dict(lparen + delimitedList(keyval) +
                Optional(',').suppress() + rparen))

    results = Dict(delimitedList(keyval)).parseString(filedata)
    print results.dump()

Gives this hierarchical structure:

- confident:
  - count: 4
  - trans:
    - anaa: 0.01432161
    - ashhvsht: 0.75100505
    - athmvishhvaas: 0.08090452
    - jitne: 0.01155779
    - milte: 0.03768845
    - phraarmnbh: 0.08341708
    - utnii: 0.02110553
- consumers:
  - count: 34
  - trans:
    - \': 0.01947301
    - ghraahkomn: 0.02580174
    - jnmt: 0.01527414
    - ne: 0.02574089
    - suuchnaa: 0.02707965
    - upbhokhtaa: 0.27374792
    - upbhokhtaaomn: 0.48493883
    - zrurtomn: 0.02753605
- markets:
  - count: 8
  - trans:
    - adr: 0.01978462
    - baaaaromn: 0.01765590
    - baajaar: 0.87628353
    - gyiimn: 0.01765590
    - kaanuun: 0.00613574
    - kiraae: 0.02341598
    - lii: 0.02178813
    - sdk: 0.01728024

You can access these fields by name like dict elements:

    print results.keys()
    print results["confident"].keys()
    print results["confident"]["trans"]["jitne"]

If the names are valid Python identifiers (which "\'" is *not*), you
can access their fields like attributes of an object:

    print results.confident.trans.jitne
    for k in results.keys():
        print k, results[k].count

Prints:

    ['confident', 'markets', 'consumers']
    ['count', 'trans']
    0.01155779
    0.01155779
    confident 4
    markets 8
    consumers 34

I've posted the full program at http://pyparsing.pastebin.com/f1d0e2182.

Welcome to pyparsing!

-- Paul