regular expression extracting groups

Paul McGuire ptmcg at austin.rr.com
Sun Aug 10 12:04:49 EDT 2008


On Aug 10, 7:56 am, Paul Hankin <paul.han... at gmail.com> wrote:
> On Aug 10, 2:30 pm, clawsi... at gmail.com wrote:
>
> > I'm trying to use regular expressions to help me quickly extract the
> > contents of messages that my application will receive.
>
> Don't use regexps for parsing complex data; they're limited,
> completely unreadable, and hugely difficult to debug. Your code is
> well written, and you've already reached the limits of the power of
> regexps, and it's difficult to read.
>
> Have a look at pyparsing for a simple solution to your problem.http://pyparsing.wikispaces.com/
>
> --
> Paul Hankin

Well, predictably, the pyparsing solution is simple UNTIL we get to
the "multidict" options field.  Pyparsing has a Dict construct that
has the same limitations as Python's dict - only the last key-value
would be retained.  So I had to write a parse action to manually
stitch the key-value groups into the parsed tokens' internal key-value
dict.

With the basic grammar implemented in pyparsing, it would now be very
easy to make some of these internal expressions optional (using
Optional wrappers), or parseable in any order (using '&' operator
instead of '+' - '&' enforces presence of all values, but in any
order).

-- Paul


from pyparsing import Suppress, Literal, Combine, oneOf, Word,
alphanums, \
                        restOfLine, ZeroOrMore, Group, ParseResults

LBRACE,RBRACE,EQ = map(Suppress,"{}=")
keylabel = lambda s : Literal(s) + EQ
grp_msg_type = Combine("xpl-" + oneOf("cmnd stat trig"))
(GROUP_MESSAGE_TYPE)
grp_hop = keylabel("hop") + Word("123456789",exact=1)(GROUP_HOP)
grp_source = keylabel("source") + Combine(Word(alphanums,max=8)
(GROUP_SRC_VENDOR_ID) + '-' +
                                Word(alphanums,max=8)
(GROUP_SRC_DEVICE_ID) + '.' +
                                Word(alphanums,max=16)
(GROUP_SRC_INSTANCE_ID)
                                )(GROUP_SOURCE)
grp_target = keylabel("target") + Combine('*'|Word(alphanums,max=8)
(GROUP_TGT_VENDOR_ID) + '-' +
                                Word(alphanums,max=8)
(GROUP_TGT_DEVICE_ID) + '.' +
                                Word(alphanums,max=16)
(GROUP_TGT_INSTANCE_ID)
                                )(GROUP_TARGET)
grp_schema = Combine(Word(alphanums,max=8)(GROUP_SCHEMA_CLASS) + '.' +
                        Word(alphanums,max=8)(GROUP_SCHEMA_TYPE)
                        )(GROUP_SCHEMA)

option_key = Word(alphanums+'-',max=16)
#~ option_val = Word(printables+' ',max=64)
option_val = restOfLine
options = (LBRACE +
    ZeroOrMore(Group(option_key("key") + EQ + option_val("value"))) +
    RBRACE)("options")

# this parse action will take the raw key=value groups and add them
to
# the current results' named tokens
def make_options_dict(tokens):
    for k,v in tokens.asList():
        if k not in tokens:
            tokens[k] = ParseResults([])
        tokens[k] += ParseResults(v)
    # delete redundant key-value created by pyparsing
    del tokens["options"]
    return tokens
options.setParseAction(make_options_dict)

msgFormat = (grp_msg_type +
                LBRACE + grp_hop + grp_source + grp_target + RBRACE +
                grp_schema +
                options)

# parse each message
for msgstr in msgdata:
    msg = msgFormat.parseString(msgstr)
    #~ print msg.dump()
    print "Message type:", msg.message_type
    print "Hop:", msg.hop
    print "Options:"
    print msg.options.dump()
    print

Prints:

Message type: xpl-stat
Hop: 1
Options:
[['interval', '10']]
- interval: ['10']

Message type: xpl-stat
Hop: 1
Options:
[['reconf', 'newconf'], ['option', 'interval '],
  ['option', 'group[16]'], ['option', 'filter[16]']]
- option: ['interval ', 'group[16]', 'filter[16]']
- reconf: ['newconf']



More information about the Python-list mailing list