Wildcard for string replacement?!?!

Mark McEahern marklists at mceahern.com
Mon Mar 10 14:41:14 EST 2003


> I 'm working for over a week on this script but I can't make my
> way out. The
> whole idea is to replace (better say delete) anything that stands between
> the <td> and</td> tag of an html file.

One thing wrong with your psuedocode is that you assume the <td>blah</td>
never spans more than one line.

I think there are two basic approaches to your problem:

1.  Use regular expressions.

2.  Use some library that lets you get at the HTML via an object model.

1 seems easier.  Try this...

#!/usr/bin/env python

import re

def disembowel(html, tag):
    """Return the html with the innards of the specified tag removed."""
    template = r'(\<%s.*?\>)(.*?)(\<\/%s\>)'
    _pattern = template % (tag, tag)
    pattern = re.compile(_pattern, re.DOTALL | re.IGNORECASE)
    return pattern.sub(r'\1\3', html)

html = """<html>
<head>
<title></title>
</head>
<body>
<table>
<tr>
<td
align="center">stuff</td>
</tr>
</table>
</body>
</html>"""

tag = 'td'
print disembowel(html, tag)

-






More information about the Python-list mailing list