[Tutor] a re question
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Thu, 4 Oct 2001 22:31:52 -0700 (PDT)
On Fri, 5 Oct 2001, Newbie Python wrote:
> How can use re to match something like:
> js="""
> <script language="JavaScript">
> blahblahblahblahblahblahblah
> blahblahblahblah
> blahblahblahblah
> </script>
> """
>
> I use this:
> re.match(r"<script.+?</script>",js,re.S)
> but it will not match..
>
> Can you please tell me why and how to write the regex?
Ah. There's a difference between Python's "match()" and "search()"
regular expression functions: match() automatically assumes that the
matching occurs right at the beginning of our text. Take a look at:
http://www.python.org/doc/lib/matching-searching.html
for more details about this. Don't worry: everyone who starts off with
Python regular expressions gets caught by this at least once. *grin*
By the way, you can also use the SGMLParser class that's in the 'sgmllib'
module: it knows how to read HTML-like pages, and it's pretty reliable.
Using sgmllib does require that you feel a little comfortable about
classes, so if you're not familiar with them, hmmm... think of this as a
motivating example. *grin*
###
import sgmllib
class MyJavascriptExtractor(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.in_javascript = 0
self.jscontent = []
def getContent(self):
return '\n'.join(self.jscontent)
##########################################################
## Below are handlers that will be called when we feed() a
## document to the parser.
def start_script(self, attributes):
self.in_javascript = 1
def handle_data(self, data):
if self.in_javascript:
self.jscontent.append(data)
def end_script(self):
self.in_javascript = 0
###
Here's an example run:
###
>>> mystr = """
... <html><body> hello world, this is a test.
... <script language="JavaScript">
... document.print("And this is another!");
... function factorial(x) {
... if (x == 0) return 1;
... else return x * factorial(x-1);
... }
... </script>"""
>>> extractor = MyJavascriptExtractor()
>>> extractor.feed(mystr)
>>> extractor.getContent()
'\ndocument.print("And this is another!");\nfunction factorial(x) {\n
if (x == 0) return 1;\n else return x * factorial(x-1);\n}\n'
###
I have no idea if the above is valid Javascript. *grin*