[Tutor] a re question

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 4 Oct 2001 22:31:52 -0700 (PDT)


On Fri, 5 Oct 2001, Newbie Python wrote:

> How can use re to match something like:
> js="""
> <script language="JavaScript">
> blahblahblahblahblahblahblah
> blahblahblahblah
> blahblahblahblah
> </script>
> """
> 
> I use this:
> re.match(r"<script.+?</script>",js,re.S)
> but it will not match..
> 
> Can you please tell me why and how to write the regex?

Ah.  There's a difference between Python's "match()" and "search()"
regular expression functions: match() automatically assumes that the
matching occurs right at the beginning of our text.  Take a look at:

    http://www.python.org/doc/lib/matching-searching.html

for more details about this.  Don't worry: everyone who starts off with
Python regular expressions gets caught by this at least once.  *grin*


By the way, you can also use the SGMLParser class that's in the 'sgmllib'
module: it knows how to read HTML-like pages, and it's pretty reliable.

Using sgmllib does require that you feel a little comfortable about
classes, so if you're not familiar with them, hmmm... think of this as a
motivating example.  *grin*

###
import sgmllib

class MyJavascriptExtractor(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.in_javascript = 0
        self.jscontent = []

    def getContent(self):
        return '\n'.join(self.jscontent)

    ##########################################################
    ## Below are handlers that will be called when we feed() a
    ## document to the parser.

    def start_script(self, attributes):
        self.in_javascript = 1

    def handle_data(self, data):
        if self.in_javascript:
            self.jscontent.append(data)
        
    def end_script(self):
        self.in_javascript = 0
###


Here's an example run:

###
>>> mystr = """
... <html><body> hello world, this is a test.
... <script language="JavaScript">
... document.print("And this is another!");
... function factorial(x) {
...     if (x == 0) return 1;
...     else return x * factorial(x-1);
... }
... </script>"""
>>> extractor = MyJavascriptExtractor()
>>> extractor.feed(mystr)
>>> extractor.getContent()
'\ndocument.print("And this is another!");\nfunction factorial(x) {\n
if (x == 0) return 1;\n    else return x * factorial(x-1);\n}\n'
###


I have no idea if the above is valid Javascript.  *grin*