Parser or regex ?

Fuzzyman fuzzyman at gmail.com
Fri Dec 16 09:38:54 EST 2005


Hello all,

I'm writing a module that takes user input as strings and (effectively)
translates them to function calls with arguments and keyword
arguments.to pass a list I use a sort of 'list constructor' - so the
syntax looks a bit like :

   checkname(arg1, "arg 2", 'arg 3', keywarg="value",
keywarg2='value2', default=list("val1", 'val2'))

Worst case anyway :-)

I can handle this with regular expressions but they are becoming truly
horrible. I wonder if anyone has any suggestions on optimising them. I
could hand write a parser - which would be more code, probably slower -
but less error prone. (Regualr expressions are subject to obscure
errors - especially the ones I create).

The trouble is that I have to pull out the separate arguments, then
pull apart the keyword arguments and the list keyword arguments. This
makes it a 'multi-pass' task - and I wondered if there was a better way
to do it.

As I use ``findall`` to pull out all the arguments - so I also have to
use a *very similar* regex to first check that there are no errors (as
findall will just miss out badly formed parts of the input).

My current approach is :

pull out the checkname and *all* the arguments using :

    '(.+?)\((.*)\)'

I then have :


_paramstring = r'''
    (?:
        (
            (?:
                [a-zA-Z_][a-zA-Z0-9_]*\s*=\s*list\(
                    (?:
                        \s*
                        (?:
                            (?:".*?")|              # double quotes
                            (?:'.*?')|              # single quotes
                            (?:[^'",\s\)][^,\)]*?)       # unquoted
                        )
                        \s*,\s*
                    )*
                    (?:
                        (?:".*?")|              # double quotes
                        (?:'.*?')|              # single quotes
                        (?:[^'",\s\)][^,\)]*?)       # unquoted
                    )?                              # last one
                \)
            )|
            (?:
                (?:".*?")|              # double quotes
                (?:'.*?')|              # single quotes
                (?:[^'",\s=][^,=]*?)|       # unquoted
                (?:                         # keyword argument
                    [a-zA-Z_][a-zA-Z0-9_]*\s*=\s*
                    (?:
                        (?:".*?")|              # double quotes
                        (?:'.*?')|              # single quotes
                        (?:[^'",\s=][^,=]*?)       # unquoted
                    )
                )
            )
        )
        (?:
            (?:\s*,\s*)|(?:\s*$)            # comma
        )
    )
    '''

I can use ``_paramstring`` with findall to pull out all the arguments.
However - as I said, I first need to check that the entrie input is
well formed. So I do a match against :

    _matchstring = '^%s*' % _paramstring

Having done a match I can use findall and ``_paramstring``  to pull out
*all* the parameters as a list - and go through each one checking if it
is a single argument, keyword argument or list constructor.

For keyword arguments and lists constructors I use another regular
expression (the appropriate part of _paramstring basically) to pull out
the values from that.

Now this approach works - but it's hardly "optimal" (for some value of
optimal). I wondered if anyone could suggest a better approach.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml




More information about the Python-list mailing list