re module non-greedy matches broken

Georg Brandl g.brandl at gmx.net
Tue Apr 5 13:07:58 EDT 2005


lothar wrote:
> give an re to find every innermost "table" element:
> 
> innertabdoc = """
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td> <a>n</a>
>   </td></tr>
> </table>
>   </td></tr>
> </table>
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td> </td> <td>
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td> <p>y</p> <td> z</td>
>   </td></tr>
> </table>
>   </td></tr>
> </table>
>   </td></tr>
>   <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
>   <tr><td>
>   </td></tr>
> </table>
>   </td></tr>
> </table>
> """

REs are Regular Expressions, not parsers. There are problems for
which there is no RE solution (I'm not implying that this is the
case in your example).

In any case, complex text processing should be done using tools
better suited to this. In this case, HTMLParser seems like a
reasonable choice.

mfg
Georg



More information about the Python-list mailing list