suggestion for module re

Jose' Sebrosa sebrosa at artenumerica.com
Mon Oct 22 05:34:40 EDT 2001


Hi

My first goal is to get all the *named* fields of a regex, for all of the
matches in a source string.  The re.RegexObject.findall method is "almost"
good, but it returns a tuple of matched *groups*, leading to a dumb
pharentesis-counting task and to code hard to maintain.  Just imagine that you
have a regex with several groups and wish to insert a new group in the middle. 
This changes the group counting, and that's no good.

In the beginning I was thinking about making a variant of the findall method
returning a dictionary of named groups instead of a tuple of groups.  But we
have already the re.MatchObject.groupdict method, (along with a bunch of other
useful match objects' methods), so I found it better to make a list of all the
match objects found in a source string.

Here is the method I would like to add to re.RegexObject (sure it could be
better named...)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    def find_all_match_obj(self, source, method = 'match'):
        """Return a list of all non-overlapping match objects in the string.
        """
        pos = 0
        end = len(source)
        results = []
        if method == 'match':
            worker = self.match
        elif method == 'search':
            worker = self.search
        else:
            if type(method) != type(''):
                raise TypeError, ('Invalid type for method: %s'
                                  % `type(method)`)
            raise ValueError, 'Invalid value for method: %s' % `method`
        append = results.append
        while pos <= end:
            m = worker(source, pos, end)
            if m == None:
                break
            append(m)
            pos = max(m.end(), pos+1)
        return results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Such a method would let users to use the full functionality of match objects in
very simple python constructs.  For example, my first problem (to get all the
named groups of all matches) get solved like this (silly example follows):

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rx = re.compile(r'((?P<a>a|A)(?P<b>b|B))')
match_list = rx.find_all_match_obj('abaBAbAB')
named_group_list = map(lambda m: m.groupdict(),  match_list)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Notes:

 - I'm using Python 1.5.2 (under Linux), but I inspected the docs for Python
2.1 and it seems to me that it too lacks a method like this.

 - I don't know which is better to use, "match" or "search", as the matching
function, so I used both...  But it still remains the question of which default
should we define.  Or if we should define *two* different methods, one with
match and other with search.  What's the usual policy?

 - I like to use the clean syntax obj.name instead of the hairy obj['name']
whenever possible.  Sure it is easy to go from one to another, but it is a pity
to have a groupdict method and not an equivalent groupobject method to return
an object usable with the cleanest syntax.  With it, we could look at a regex
with named groups as a way to define a set of standard names in Python (and to
give them values).  Furthermore, the requirements for named groups in regexs
are that the names must be valid Python names, so it is *really* a pity...! 
Anyway, that's just an aesthetic modification.


Thanks,
Sebrosa



More information about the Python-list mailing list