a RegEx puzzle

Jeremy Bowers jerf at jerf.org
Fri Mar 11 07:40:34 EST 2005


On Fri, 11 Mar 2005 08:38:36 -0500, Charles Hartman wrote:

> I'm still shaky on some of sre's syntax. Here's the task: I've got 
> strings (never longer than about a dozen characters) that are 
> guaranteed to be made only of characters 'x' and '/'. In each string I 
> want to find the longest continuous stretch of pairs whose first 
> character is 'x' and the second is either mark. So if marks = 
> '/xx/xxx///', the "winning" stretch begins at position 2 and is 6 
> characters long ('x/xxx/')

I don't think regexes are a good match for this task. They just aren't
optimized for this very well.

Here's what I have, though I don't know if it's *exactly* what you want:

def xCounter(s, initialOffset):
    offset = initialOffset
    count = 0
    slen = len(s)
    while s[offset] == 'x':
        count = count + 1
        offset += 2
        if offset > slen:
            break
    return count, s[initialOffset:offset]

def longestRun(s):
    return max([xCounter(s, i) for i in range(len(s))])[1]

In other words, start at all the positions and find the max, sort of "by
hand".

(In particular, I don't know how this will handle two runs of equal size,
if it will prefer the first or the last. I don't know how *you* handle
that, either, so I'm not sweating it for now. If you clarify this I can
help, the easiest way is to add another number into the tuple that
xCounter generates for max to work on.) 

This is not optimized, and if you're processing millions of these, this
may be too slow. What I'd suggest is that if this isn't fast enough,
memoize the function, i.e., with this:

longestRunResults = {}

def memoizedLongestRun(s):
    try:
        return longestRunResults[s]
    except KeyError: pass

    result = longestRun(s)
    longestRunResults[s] = result
    return result

You'll probably want to change the names of things to better fit.

This will get you the benefits of pre-computation without paying the
up-front costs (you only compute what you need, not all possible
combinations), and this ought to be plenty fast to process a whole lotta
data really quickly.

(This is a great example of why readable code can be more important than
fast code; the speedup can be added later, it's more important that you
can satisfy yourself that the code is correct, and a "clever" algorithm
might make that hard.)



More information about the Python-list mailing list