Wildcard for string replacement?!?!
Mark McEahern
marklists at mceahern.com
Mon Mar 10 14:41:14 EST 2003
> I 'm working for over a week on this script but I can't make my
> way out. The
> whole idea is to replace (better say delete) anything that stands between
> the <td> and</td> tag of an html file.
One thing wrong with your psuedocode is that you assume the <td>blah</td>
never spans more than one line.
I think there are two basic approaches to your problem:
1. Use regular expressions.
2. Use some library that lets you get at the HTML via an object model.
1 seems easier. Try this...
#!/usr/bin/env python
import re
def disembowel(html, tag):
"""Return the html with the innards of the specified tag removed."""
template = r'(\<%s.*?\>)(.*?)(\<\/%s\>)'
_pattern = template % (tag, tag)
pattern = re.compile(_pattern, re.DOTALL | re.IGNORECASE)
return pattern.sub(r'\1\3', html)
html = """<html>
<head>
<title></title>
</head>
<body>
<table>
<tr>
<td
align="center">stuff</td>
</tr>
</table>
</body>
</html>"""
tag = 'td'
print disembowel(html, tag)
-
More information about the Python-list
mailing list