Wildcard for string replacement?!?!

Jeff Epler jepler at unpythonic.net
Mon Mar 10 14:32:27 EST 2003


You could use "regular expressions" to approach this problem.

Here is a regular expression that matches "from <td> to the next </td>":
    "<td>.*?</td>"

Here is how you could use it to for the replacement (untested):
    for line in inp.readlines():
        line = re.sub("<td>.*?</td>", " ", line)
        outp.write(nline)
note that this will not work if the "contents" of the <td> is split
across two lines, like so:
    <td> here are some words
    and here are some more </td>

The second argument to re.sub can be a function.  It will be passed a
"match object", so you have the opportunity to replace with something
besides a particular string.

Here are some more possible problems with your approach:
    * the first html (4.0) reference I opened says that the </td> is an
      optional, so it may not even be present in the file
    * "non-greedy matches" (such as .*?) can lead to poor performance.
    * As written, <TD>, <Td>, and other alternatives which will be
      accepted by browsers will not be caught by the program

If you're truly going to be processing html, you should use a module
which parses HTML correctly and accurately.  Python has several modules
built-in to do this.  It is not hard to write a class which does simple
processing on an HTML file, changing or omitting one tag and passing the
rest through unchanged.

Jeff





More information about the Python-list mailing list