regular expression extracting groups

clawsicus at gmail.com clawsicus at gmail.com
Sun Aug 10 08:30:23 EDT 2008


Hi list,

I'm trying to use regular expressions to help me quickly extract the
contents of messages that my application will receive. I have worked
out most of the regex but the last section of the message has me
stumped. This is mostly because I want to pull the content out into
regex groups that I can easily access later. I have a regex to extract
the key/value pairs but it ends up with only the contents of the last
key/value pair encountered.

An example of the section of the message that is troubling me appears
like this:

{
option=value
foo=bar
another=42
option=7
}

So it's basically a bunch of lines. Every line is terminated with a
'\n' character. The number of key/value fields changes depending on
the particular message. Also notice that there are two 'option' keys.
This is allowable and I need to cater for it.


A couple of example messages are:
xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=*\n}
\nhbeat.basic\n{\ninterval=10\n}\n

xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=vendor-
device.instance\n}\nconfig.list\n{\nreconf=newconf\noption=interval
\noption=group[16]\noption=filter[16]\n}\n


As all messages follow the same pattern I'm hoping to develop a
generic regex, instead of one for each message kind - because there
are many, that can pull a message from a received packet.



The regex I came up with looks like this:
# This should match any xPL message

GROUP_MESSAGE_TYPE = 'message_type'
GROUP_HOP = 'hop'
GROUP_SOURCE = 'source'
GROUP_TARGET = 'target'
GROUP_SRC_VENDOR_ID = 'source_vendor_id'
GROUP_SRC_DEVICE_ID = 'source_device_id'
GROUP_SRC_INSTANCE_ID = 'source_instance_id'
GROUP_TGT_VENDOR_ID = 'target_vendor_id'
GROUP_TGT_DEVICE_ID = 'target_device_id'
GROUP_TGT_INSTANCE_ID = 'target_instance_id'
GROUP_IDENTIFIER_TYPE = 'identifier_type'
GROUP_SCHEMA = 'schema'
GROUP_SCHEMA_CLASS = 'schema_class'
GROUP_SCHEMA_TYPE = 'schema_type'
GROUP_OPTION_KEY = 'key'
GROUP_OPTION_VALUE = 'value'


XplMessageGroupsRe = r'''(?P<%s>xpl-(cmnd|stat|trig))
\n                 # message type
   \
{\n
#
   hop=(?P<%s>[1-9]{1})
\n                                                              # hop
count
   source=(?P<%s>(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16}))\n  # source identifier
   target=(?P<%s>(\*|(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16})))\n  # target identifier
   \}
\n
#
   (?P<%s>(?P<%s>[a-z0-9]{1,8})\.(?P<%s>[a-z0-9]{1,8}))\n
# schema
   \
{\n
#
   (?:(?P<%s>[a-z0-9\-]{1,16})=(?P<%s>[\x20-\x7E]{0,128})\n){1,64}   #
key/value pairs
   \}\n''' % (GROUP_MESSAGE_TYPE,
              GROUP_HOP,
              GROUP_SOURCE,
              GROUP_SRC_VENDOR_ID,
              GROUP_SRC_DEVICE_ID,
              GROUP_SRC_INSTANCE_ID,
              GROUP_TARGET,
              GROUP_TGT_VENDOR_ID,
              GROUP_TGT_DEVICE_ID,
              GROUP_TGT_INSTANCE_ID,
              GROUP_SCHEMA,
              GROUP_SCHEMA_CLASS,
              GROUP_SCHEMA_TYPE,
              GROUP_OPTION_KEY,
              GROUP_OPTION_VALUE)

XplMessageGroups = re.compile(XplMessageGroupsRe, re.VERBOSE |
re.DOTALL)


If I pass the second example message through this regex the 'key'
group ends up containing 'option' and the 'value' group ends up
containing 'filter[16]' which are the last key/value pairs in that
message.

So the problem I have lies in the key/value regex extraction section.
It handles multiple occurrences of the pattern and writes the content
into the single key/value group hence I can't extract and access all
fields.

Is there some other way to do this which allows me to store all the
key/value pairs into the regex match object for later retrieval?
Perhaps using the standard unnamed number groups?

Thanks,
Chris



More information about the Python-list mailing list