RegEx with multiple occurrences
Tim Chase
python.list at tim.thechases.com
Thu May 4 07:54:17 EDT 2006
> p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
> m = p.search(data)
First, I presume you didn't copy & paste your expression, as
it looks like you're missing a period before the second
asterisk. Otherwise, all you'd get is any number of
greater-than signs followed by a closing "</script>" tag.
Second, you're likely getting some foobar results because
you're not using a "real" string of the form
r'(\<script...script>)'
> The problem is that I'm getting everything from the 1st
> script's start tag to the last script's end tag in one
> group - so it seems like it parses the string from both
> ends therefore removing far more from that data than I
> want. What am I doing wrong?
Looks like you want the non-greedy modifier to the "*"
described at
http://docs.python.org/lib/re-syntax.html
(searching the page for "greedy" should turn up the
paragraph on the modifiers)
You likely want something more like:
r'<script[^>]*>.*?</script>'
In the first atom, you're looking for the remainder of the
script tag (as much stuff that isn't a ">" as possible).
Then you close the tag with the ">", and then take as little
as possible (".*?") of anything until you find the closing
"</script>" tag.
HTH,
-tkc
More information about the Python-list
mailing list