re troubles

Mon Dec 22 11:29:03 EST 2003

Evanda Remington <evanda at remingtons.org> wrote:
> I'm trying to filter some rows of an html table out, based on their
> contents.  For input like:
> """
><table>
>  <tr>
>     <td>Lasers</td><td>17</td> </tr>
>  <tr>                                            <<  want to filter
>     <td>kittens</td><td>8</td>                    <<  this out.
>  </tr>                                           <<
>  <tr> <td>robots</td><td>8</td> </tr>
></table>
> """
> I would like to completely remove the (3 line) table row that makes mention
> of kittens.  The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
> When compiled and used with subs("",data), strangely removes everything
> from the first "<tr>" to the first "<tr>" after kittens.
>
> That is, the ".*?" notation works in the second half, but not in the first
> half.  It behaves the same as ".*" should.
>
> Any advice?

Parsing HTML with regular expressions is notoriously tricky. Have you
tried using HTMLParser yet? If you've tried it and it doesn't work for
you for some reason, then you may have to deal with regexp's. But if you
haven't tried HTMLParser, you may find it a lot easier than regexp's for
this task.

-- 
Robin Munn
rmunn at pobox.com