[Tutor] Avoiding repetetive pattern match in re module

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Jan 5 20:27:30 CET 2006


> Instead of writing a new 'for,if' loop to filter the repetetive tags
> from the list, is there something that I can add in the re itself to
> match the pattern only once?

Hi Intercodes,

As far as I know, no: regular expressions don't have the capacity to
remember what tags they've matched in previous matching runs.

You may find "sets" useful in filtering out duplicates:

    http://www.python.org/doc/lib/module-sets.html

For example:

######
>>> from sets import Set as set
>>> set([3, 1, 4, 1, 5, 9, 2, 6])
Set([1, 2, 3, 4, 5, 6, 9])
######

So you should not need to code much extra logic to do the duplication
filtering: sets do this for you already.



As an aside, I want to second Alan's recommendation to avoid doing so much
string concatenation: it's actually very expensive to do the following:

#########################
ans = ''
while 1:
    data=file1.readline()
    if data=="":
        break
    ans=ans+data
#########################

>From a technical standpoint, it has "quadratic" complexity in terms of
what work the computer is doing.  It's related to the mathematical idea
that 1 + 2 + 3 + 4 + ... + n = n(n+1).

    http://en.wikipedia.org/wiki/Triangle_number

The string concatenation that the code does above is analogous because it
rebuilds larger and larger strings in a similar fashion.  Alan's approach,
to use file1.read(), sucks up the string in one shot, so it's much less
computationally expensive.

If you do end up having to concatenate a lot of strings together, use the
join() method of lists instead.  We can talk about this in more detail if
you'd like.



More information about the Tutor mailing list