Wildcard for string replacement?!?!
Jeff Epler
jepler at unpythonic.net
Mon Mar 10 14:32:27 EST 2003
You could use "regular expressions" to approach this problem.
Here is a regular expression that matches "from <td> to the next </td>":
"<td>.*?</td>"
Here is how you could use it to for the replacement (untested):
for line in inp.readlines():
line = re.sub("<td>.*?</td>", " ", line)
outp.write(nline)
note that this will not work if the "contents" of the <td> is split
across two lines, like so:
<td> here are some words
and here are some more </td>
The second argument to re.sub can be a function. It will be passed a
"match object", so you have the opportunity to replace with something
besides a particular string.
Here are some more possible problems with your approach:
* the first html (4.0) reference I opened says that the </td> is an
optional, so it may not even be present in the file
* "non-greedy matches" (such as .*?) can lead to poor performance.
* As written, <TD>, <Td>, and other alternatives which will be
accepted by browsers will not be caught by the program
If you're truly going to be processing html, you should use a module
which parses HTML correctly and accurately. Python has several modules
built-in to do this. It is not hard to write a class which does simple
processing on an HTML file, changing or omitting one tag and passing the
rest through unchanged.
Jeff
More information about the Python-list
mailing list