Pyparsing help

Sat Mar 22 19:59:47 EDT 2008

On Mar 22, 4:11 pm, rh0dium <steven.kl... at gmail.com> wrote:
> Hi all,
>
> I am struggling with parsing the following data:
>
<snip>
> As a side note:  Is this the right approach to using pyparsing.  Do we
> start from the inside and work our way out or should I have started
> with looking at the bigger picture ( keyword + "{" + OneOrMore key /
> vals + "}" + )  I started there but could figure out how to look
> multiline - I'm assuming I'd just join them all up?
>
> Thanks

I think your "inside-out" approach is just fine.  Start by composing
expressions for the different "pieces" of your input text, then
steadily build up more and more complex forms.

I think the main complication you have is that of using
commaSeparatedList for your list of real numbers.  commaSeparatedList
is a very generic helper expression.  From the online example (http://
pyparsing.wikispaces.com/space/showimage/commasep.py), here is a
sample of the data that commaSeparatedList will handle:

    "a,b,c,100.2,,3",
    "d, e, j k , m  ",
    "'Hello, World', f, g , , 5.1,x",
    "John Doe, 123 Main St., Cleveland, Ohio",
    "Jane Doe, 456 St. James St., Los Angeles , California ",

In other words, the content of the items between commas is pretty much
anything that is *not* a comma.  If you change your definition of
atflist to:

    atflist = Suppress("(") + commaSeparatedList # + Suppress(")")

(that is, comment out the trailing right paren), you'll get this
successful parse result:

    ['0.21', '0.24', '0.6', '0.24', '0.24', '0.6)']

In your example, you are parsing a list of floating point numbers, in
a list delimited by commas, surrounded by parens.  This definition of
atflist should give you more control over the parsing process, and
give you real floats to boot:

    floatnum = Combine(Word(nums) + "." + Word(nums) +
        Optional('e'+oneOf("+ -")+Word(nums)))
    floatnum.setParseAction(lambda t:float(t[0]))
    atflist = Suppress("(") + delimitedList(floatnum) + Suppress(")")

Now I get this output for your parse test:

    [0.20999999999999999, 0.23999999999999999, 0.59999999999999998,
     0.23999999999999999, 0.23999999999999999, 0.59999999999999998]

So you can see that this has actually parsed the numbers and converted
them to floats.

I went ahead and added support for scientific notation in floatnum,
since I see that you have several atfvalues that are standalone
floats, some using scientific notation.  To add these, just expand
atfvalues to:

    atfvalues = ( floatnum | Word(nums) | atfstr | atflist )

(At this point, I'll go on to show how to parse the rest of the data
structure - if you want to take a stab at it yourself, stop reading
here, and then come back to compare your results with my approach.)

To parse the overall structure, now that you have expressions for the
different component pieces, look into using Dict (or more simply using
the helper function dictOf) to define results names automagically for
you based on the attribute names in the input.  Dict does *not* change
any of the parsing or matching logic, it just adds named fields in the
parsed results corresponding to the key names found in the input.

Dict is a complex pyparsing class, but dictOf simplfies things.
dictOf takes two arguments:

    dictOf(keyExpression, valueExpression)

This translates to:

    Dict( OneOrMore( Group(keyExpression + valueExpression) ) )

For example, to parse the lists of entries that look like:

    name                            = "gtc"
    dielectric                      = 2.75e-05
    unitTimeName                    = "ns"
    timePrecision                   = 1000
    unitLengthName                  = "micron"
    etc.

just define that this is "a dict of entries each composed of a key
consisting of a Word(alphas), followed by a suppressed '=' sign and an
atfvalues", that is:

    attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)

dictOf takes care of all of the repetition and grouping necessary for
Dict to do its work.  These attribute dicts are nested within an outer
main dict, which is "a dict of entries, each with a key of
Word(alphas), and a value of an optional quotedString (an alias,
perhaps?), a left brace, an attrDict, and a right brace," or:

    mainDict = dictOf(
        Word(alphas),
        Optional(quotedString)("alias") +
            Suppress("{") + attrDict + Suppress("}")
        )

By adding this code to what you already have:

    attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)
    mainDict = dictOf(
        Word(alphas),
        Optional(quotedString)("alias") +
            Suppress("{") + attrDict + Suppress("}")
        )

You can now write:

    md = mainDict.parseString(test1)
    print md.dump()
    print md.Layer.lineStyle

and get this output:

[['Technology', ['name', 'gtc'], ['dielectric',
2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
'1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
['gridResolution', '5'], ['unitVoltageName', 'v'],
['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
['currentPrecision', '1000'], ['unitPowerName', 'pw'],
['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
['inductancePrecision', '100']], ['Tile', 'unit', ['width', 0.22],
['height', 1.6899999999999999]], ['Layer', 'PRBOUNDARY',
['layerNumber', '0'], ['maskName', ''], ['visible', '1'],
['selectable', '1'], ['blink', '0'], ['color', 'cyan'], ['lineStyle',
'solid'], ['pattern', 'blank'], ['pitch', '0'], ['defaultWidth', '0'],
['minWidth', '0'], ['minSpacing', '0']]]
- Layer: ['PRBOUNDARY', ['layerNumber', '0'], ['maskName', ''],
['visible', '1'], ['selectable', '1'], ['blink', '0'], ['color',
'cyan'], ['lineStyle', 'solid'], ['pattern', 'blank'], ['pitch', '0'],
['defaultWidth', '0'], ['minWidth', '0'], ['minSpacing', '0']]
  - alias: PRBOUNDARY
  - blink: 0
  - color: cyan
  - defaultWidth: 0
  - layerNumber: 0
  - lineStyle: solid
  - maskName:
  - minSpacing: 0
  - minWidth: 0
  - pattern: blank
  - pitch: 0
  - selectable: 1
  - visible: 1
- Technology: [['name', 'gtc'], ['dielectric',
2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
'1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
['gridResolution', '5'], ['unitVoltageName', 'v'],
['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
['currentPrecision', '1000'], ['unitPowerName', 'pw'],
['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
['inductancePrecision', '100']]
  - capacitancePrecision: 10000000
  - currentPrecision: 1000
  - dielectric: 2.75e-005
  - gridResolution: 5
  - inductancePrecision: 100
  - lengthPrecision: 1000
  - name: gtc
  - powerPrecision: 1000
  - resistancePrecision: 10000000
  - timePrecision: 1000
  - unitCapacitanceName: pf
  - unitCurrentName: ma
  - unitInductanceName: nh
  - unitLengthName: micron
  - unitPowerName: pw
  - unitResistanceName: kohm
  - unitTimeName: ns
  - unitVoltageName: v
  - voltagePrecision: 1000000
- Tile: ['unit', ['width', 0.22], ['height', 1.6899999999999999]]
  - alias: unit
  - height: 1.69
  - width: 0.22
solid

Cheers!
-- Paul