Wildcard for string replacement?!?!

Steven Taschuk staschuk at telusplanet.net
Mon Mar 10 16:07:20 EST 2003


Quoth Perverted Orc:
> I 'm working for over a week on this script but I can't make my way out. The
> whole idea is to replace (better say delete) anything that stands between
> the <td> and</td> tag of an html file.
  [...]
> for line in inp.readlines():
>      nline=replace(line,"?????"," ")
>      outp.write(nline)
  [...]
> Can anyone tell me what to fill in the ????? area? I tried any possible
> combination with "." and "*" but didn't work.

string.replace doesn't know anything about wildcards; it only
replaces exact substrings.

If you really want pattern matching, here's a start:
	import re
	re.sub('<td>.*?</td>', '', '...')
You'll need to make some additions to deal with variant
capitalizations, cells spanning more than one line, and attributes
occurring on the start tag.  (And, as Laotseu pointed out, the
problem is underspecified: What about nested tables?  What about
table cells inside HTML comments?  Why just <td> and not <th>?  Or
the entire table?)

A more serious problem is that pre-XHTML versions of HTML permit
the closing tag </td> to be implicit.  Consider this (abbreviated)
example from the HTML 4.01 spec:
	<TABLE>
	<TR><TH>Males<TD>1.9<TD>0.003<TD>40%
	<TR><TH>Females<TD>1.7<TD>0.002<TD>43%
	</TABLE>
Here we have <td> elements implicitly terminated by other <td>s,
by <tr>s, and by </table>.  Identifying table cells is much more
difficult in such cases (especially since, as Laotseu pointed out,
cells can contain nested tables).

And this is to say nothing about malformed HTML, which is quite
common in the wild.

Far better to forget about pattern matching and use an existing
HTML parser.

-- 
Steven Taschuk                                                   w_w
staschuk at telusplanet.net                                      ,-= U
                                                               1 1





More information about the Python-list mailing list