regexp search question

Francis Avila francisgavila at yahoo.com
Wed Oct 22 20:14:36 EDT 2003


"Paul Rubin" <http://phr.cx@NOSPAM.invalid> wrote in message
news:7xsmlk97pt.fsf_-_ at ruckus.brouhaha.com...
> I have a string s, possibly megabytes in size, and two regexps, p and q.
>
> I want to find the first occurence of q that occurs after the first
> occurence of p.
>
> Is there a reasonable way to do it?
>
>     g1 = re.search(p, s)
>     g2 = re.search(q, s[g1.end():])
>     q_offset = g1.end() + g2.start()
>
>
> is not a reasonable way, since it copies a ton of data around
> (slicing an arbitrary sized chunk off s into a new temporary string).
>
> Most regexps libs I know of have a way to start the search at a
> specified offset.  Python's string.find and string.index methods
> have a similar optional arg.  But I don't see it described in the
> re module docs.
>
> Am I missing something?

Yes: you can specify an offset, but only in the search METHOD (of re
objects), not the search function (for that, you just use slicing of the
string, see?)


Alternative 1:
Instead of slicing the string, make a buffer object that references to a
slice of the string (using the buffer() builtin)
NOTE: Don't do this!

Alternative 2:
Compile a regular expression object for p and q, instead of doing a match.
Since I don't know the implementation details or re, I don't know if the
start/end args to REOBJECT.search will copy the string or use a buffer--so
that may not be different from what you're doing.  However, compiling the re
will certainly be faster, if you do this search more than once.
(NOTE: untested code!)

p = re.compile(ppattern)
q = re.compile(qpattern)
matchp = p.search(somestring)
pend = matchp.end()
matchq = q.search(somestring, pend)
qstart = matchq.start()

Now I'm not sure if matchq.start() returns index from the substring or the
whole string.  You'll just have to try it and see...

if counts from substring:
offset = matchq.pos + matchq.start() # == matchp.end() + matchq.start().
else:
offset = matchq.start()

Alternative 3:
You could probably combine p and q into a single regexp specifying that you
match p, then q, with anything inbetween.  Using groups (p is grp 1, q is
grp 2), get your offset with matchpq.end(1) + matchpq.start(2)

There are probably many other ways.


> Thanks.

No problem.
--
Francis Avila





More information about the Python-list mailing list