RegEx with multiple occurrences

Tim Chase python.list at tim.thechases.com
Thu May 4 07:54:17 EDT 2006


> p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
> m = p.search(data)

First, I presume you didn't copy & paste your expression, as 
it looks like you're missing a period before the second 
asterisk.  Otherwise, all you'd get is any number of 
greater-than signs followed by a closing "</script>" tag.

Second, you're likely getting some foobar results because 
you're not using a "real" string of the form

	r'(\<script...script>)'

> The problem is that I'm getting everything from the 1st
> script's start tag to the last script's end tag in one
> group - so it seems like it parses the string from both
> ends therefore removing far more from that data than I
> want. What am I doing wrong?

Looks like you want the non-greedy modifier to the "*" 
described at

http://docs.python.org/lib/re-syntax.html

(searching the page for "greedy" should turn up the 
paragraph on the modifiers)

You likely want something more like:

	r'<script[^>]*>.*?</script>'

In the first atom, you're looking for the remainder of the 
script tag (as much stuff that isn't a ">" as possible). 
Then you close the tag with the ">", and then take as little 
as possible (".*?") of anything until you find the closing 
"</script>" tag.

HTH,

-tkc







More information about the Python-list mailing list