stripping HTML comments in the face of programmer errors

Skip Montanaro skip at pobox.com
Fri Nov 8 06:00:23 EST 2002


HTML comments aren't supposed to be nested, nor are they supposed to enclose
unescaped HTML tags, but people routinely commit both sins anyway.  People
also forget to close HTML comments, but for the most part, browsers still
seem to display such pages more-or-less correctly.

I have an HTML comment stripping function which handles the nesting part
okay:

    def zapcomment(data):
        data = re.split("(<!--|-->)", data)
        nest = 0
        newdata = []
        for i in range(len(data)):
            if data[i] == "<!--":
                nest += 1
            elif data[i] == "-->":
                nest = max(0, nest-1)
            elif nest == 0:
                newdata.append(data[i])
        return "".join(newdata)

but I'm sort of at a loss how to handle the case of runaway comments, e.g.:

    <script language="JavaScript" type="text/javascript">
    <!--
    <!-- Hide script from old browsers
    myPix1 = new Array("gp1/gp1-pic1.gif","gp1/gp1-pic2.gif","gp1/gp1-pic3.gif","gp1/gp1-pic4.gif")
    myPix2 = new Array("gp2/gp2-pic1.gif","gp2/gp2-pic2.gif","gp2/gp2-pic3.gif","gp2/gp2-pic4.gif")
    myPix3 = new Array("gp3/gp3-pic1.gif","gp3/gp3-pic2.gif","gp3/gp3-pic3.gif","gp3/gp3-pic4.gif")
    myPix4 = new Array("gp4/gp4-pic1.gif","gp4/gp4-pic2.gif","gp4/gp4-pic3.gif","gp4/gp4-pic4.gif")
    function choosePix() {
            if (document.images) {
                    randomNum = Math.floor((Math.random() * myPix1.length))
                    document.myPicture1.src = myPix1[randomNum]

                    randomNum = Math.floor((Math.random() * myPix2.length))
                    document.myPicture2.src = myPix2[randomNum]

                    randomNum = Math.floor((Math.random() * myPix3.length))
                    document.myPicture3.src = myPix3[randomNum]

                    randomNum = Math.floor((Math.random() * myPix4.length))
                    document.myPicture4.src = myPix4[randomNum]
            }
    }
    // End hiding script from old browsers -->
    </script>

Anybody out there got a bit of code which implements a useful heuristic for
that case?  Ideally, stripping comments from the above would yield

    <script language="JavaScript" type="text/javascript">
    </script>

Thanks,

-- 
Skip Montanaro - skip at pobox.com
http://www.mojam.com/
http://www.musi-cal.com/




More information about the Python-list mailing list