searching backwards in a string

Paul Rubin phr-n2002a at nightsong.com
Thu Feb 14 01:21:11 EST 2002


sjmachin at lexicon.net (John Machin) writes:
> > Suppose I'm parsing the file and I see a </table> tag and I want to
> > find the matching <table> tag.  It could be pretty far back in the file.
> > That's what I was doing when I encountered this question.
> 
> Paul, I really admire your energy, writing your own HTML parser. I'm a
> lazy old so-and-so, if I had the slightest interest in HTML I'd be
> looking at the HTMLParser and htmllib modules. What was there about
> them that didn't suit your purpose?

I'm not trying to write a general purpose HTML parser; I'm trying to
pull some specific data out of a specific bunch of HTML files that are
several megabytes long and would probably break the htmllib modules.

> ... and if I had the energy to write an HTML parser I probably would
> have used some dumb old technique like reading the file forwards and
> if I met <foo> I'd put it in a stack (or some other data structure) of
> "unclosed" tags (together with the position in the file where I found
> it) and when I met the corresponding </foo> I'd take the appropriate
> "foo" action (which would probably involve the use of the data that
> I'd found between <foo> and </foo>)and then rip "foo" out of the
> pending bag ... using a regex backwards to find the opening tag is so
> innovative that I'm totally gobsmacked.

Basically I was trying to rewrite an Emacs macro as a Python script.
There was no need for such fancy structures.  Anyway, searching
backwards is a perfectly natural thing to want to do, so if there's
no easy way to do it, there's a deficiency in the re module.  (The
underlying regex library takes a direction flag, as mentioned before,
so the re module doesn't fully implement the regex library's API).

> > But searching
> > backwards is a normal thing to want to do in general--for example it's
> > a standard command in any decent text editor.
> 
> ... and in many indecent text editors. However AFAIK the
> implementation is to go back a line at a time and do a forward regex
> search in each line.

That depends on how the editor is implemented, of course.  And that
scheme breaks if the regex is supposed to match something that's
split across multiple lines.

> > Anyway, I just entered a sourceforge bug about it being missing from
> > Python's re module.
> 
> Looks like the effbot gets to be gobsmacked too.

I'm not sure what you mean by that.

> > Thanks
> 
> No, Paul, thank *you* -- you've made my day.

Or that either.



More information about the Python-list mailing list