pattern match

Wed Apr 16 11:38:50 EDT 2003

Gabor Nagy wrote:

> I have a string, like '(FOO(1) BAR("2"))'
> 
> I'd like to have a { 'FOO':1, 'BAR':'2' }
> 
> Is there an easy way to do this, or do I have to write parsing, maybe use
> regexps, or something like that?

Some parsing is obviously needed -- turning a string that respects
some syntax into "more meaningful objects" (in some appropriate
context for "meaningful";-) is the DEFINITION of parsing.

You may or may not have to "write parsing", depending on what syntax
you need to deal with in your input strings -- and regular expressions
may or may not be the best tools for the conceptually lower parts of
the parsing job ("tokenizing").  If the tokens in your strings happen
to respect exactly the same rules as Python's tokens, in particular,
you can use the excellent tokenize module in the standard library...:

>>> st = 'FOO(1) BAR("2")'
>>> import tokenize
>>> import cStringIO
>>> for x in tokenize.generate_tokens(cStringIO.StringIO(st).readline):
...   print x
...
(1, 'FOO', (1, 0), (1, 3), 'FOO(1) BAR("2")')
(50, '(', (1, 3), (1, 4), 'FOO(1) BAR("2")')
(2, '1', (1, 4), (1, 5), 'FOO(1) BAR("2")')
(50, ')', (1, 5), (1, 6), 'FOO(1) BAR("2")')
(1, 'BAR', (1, 7), (1, 10), 'FOO(1) BAR("2")')
(50, '(', (1, 10), (1, 11), 'FOO(1) BAR("2")')
(3, '"2"', (1, 11), (1, 14), 'FOO(1) BAR("2")')
(50, ')', (1, 14), (1, 15), 'FOO(1) BAR("2")')
(0, '', (2, 0), (2, 0), '')
>>>

as you see, you need to provide generate_tokens a "readline-like"
function -- when what you have to start with is a string, wrapping
a cStringIO.StringIO around the string and passing its bound
method .readline is generally simplest.

Each item x returned by looping on generate_tokens is a 5-items
tuple.  x[0] is an arbitrary code -- 1 means an identifier, 50
means punctuation, 2 a number, etc -- you can learn a little bit
more by checking the dictionary tokenize.tok_name.  x[1] is the
substring of the input string that corresponds to the token.
Then, you get indications of WHERE the token is in the input
string (where it starts and ends in terms of lines and columns)
and the input string itself -- stuff we don't care about.

So here's a perhaps more readable way to show what's going on...:

>>> for x in tokenize.generate_tokens(cStringIO.StringIO(st).readline):
...   print '%12s %r'%(tokenize.tok_name[x[0]], x[1])
...
        NAME 'FOO'
          OP '('
      NUMBER '1'
          OP ')'
        NAME 'BAR'
          OP '('
      STRING '"2"'
          OP ')'
   ENDMARKER ''
>>>

this still doesn't give you the dict you want, of course.
But then, we don't yet know what syntax your strings are in --
you haven't told us!  If the syntax is simple enough, and in
particular if you don't need to diagnose ERRORS in the input
strings, the task might perhaps be trivial...:

>>> def makedict(st):
...   thedict = {}
...   for x in tokenize.generate_tokens(cStringIO.StringIO(st).readline):
...     whattok = tokenize.tok_name[x[0]]
...     if whattok == 'NAME': name=x[1]
...     elif whattok in ('NUMBER','STRING'): thedict[name]=eval(x[1])
...   return thedict
...
>>> makedict(st)
{'FOO': 1, 'BAR': '2'}
>>>

here I'm ignoring punctuation, and just assuming that, apart from
punctuation and whitespace, the string is a sequence alternating
identifiers and values, where all values are numbers or strings.

Of course, it isn't very hard to perform more thorough checks, so
that you can identify mistakes in the input string -- or accept a
richer syntax for said input string -- but you'll really need to
tell us more about the latter's syntax before we can offer any
further help in these respects.

Alex