[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Mon Aug 23 01:25:52 CEST 2004

"Fredrik Lundh" <fredrik at pythonware.com> writes:
> well, the examples in your PEP can be written as:
> 
>     data = [line[:-1].split(":") for line in open(filename)]

Yes, in practice I would write this, too.  The example was for pedagogical
purposes, but perhaps the fact that it's not particularly useful in practice
makes it a bad example.

> and
> 
>     import ConfigParser
> 
>     c = ConfigParser.ConfigParser()
>     c.read(filename)
> 
>     data = []
>     for section in c.sections():
>         data.append((section, c.items(section)))
> 
> both of which are shorter than your structparse examples.

Hmm.  In this case it doesn't seem fair to compare a call to ConfigParser,
rather than the code in the ConfigParser module itself (or at least the subset
of it that would provide the equivalent functionality).

I used this as an example because I thought most people would be familiar with
this file format, thus saving them having to figure out some new file format
in order to follow the PEP.  In practice, there are lots of other file formats
of similar complexity that are not handled by any such special purpose module,
and structmatch would make it easy to parse them.  For example, this "SQT"
format is output by a certain mass spec analysis program:

        http://fields.scripps.edu/sequest/SQTFormat.html

There are a number of other bioinformatics programs whose output unfortunately
must be scraped at present.  The structmatch feature would also be useful for
these cases.  (This is what motivated the PEP.)

> and most of the one-liners in your pre-PEP can be handled with a
> combination of "match" and "finditer".

I think "findall" and "finditer" are almost useless for this kind of thing, as
they are essentially "searching" rather than "matching".  That is, they'll
happily, silently skip over garbage to get to something they like.  Since
finditer returns matches, you can always inspect the match to determine
whether anything was skipped, but this seems kind of lame compared to just
doing the right thing in the first place (i.e., matching).

> here's a 16-line helper that
> parses strings matching the "a(b)*c" pattern into a prefix/list/tail tuple.
> 
>     import re
> 
>     def parse(string, pat1, pat2):
>         """Parse a string having the form pat1(pat2)*"""
>         m = re.match(pat1, string)
>         i = m.end()
>         a = m.group(1)
>         b = []
>         for m in re.compile(pat2 + "|.").finditer(string, i):
>             try:
>                 token = m.group(m.lastindex)
>             except IndexError:
>                 break
>             b.append(token)
>             i = m.end()
>         return a, b, string[i:]
> 
> >>> parse("hello 1 2 3 4 # 5", "(\w+)", "\s*(\d+)")
> ('hello', ['1', '2', '3', '4'], ' # 5')

No offense, but this code makes me cringe.  The "|." trick seems like a
horrific hack, and I'd need to stare at this code for quite a while to
convince myself that it doesn't have some subtle flaw.  And even then it just
handles the "a(b)*c" case.  It seems like the code for more complex patterns
parsed this way would just explode in size, and would have to be written
custom for each pattern.

Mike