re module non-greedy matches broken
Georg Brandl
g.brandl at gmx.net
Tue Apr 5 13:07:58 EDT 2005
lothar wrote:
> give an re to find every innermost "table" element:
>
> innertabdoc = """
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td> <a>n</a>
> </td></tr>
> </table>
> </td></tr>
> </table>
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td> </td> <td>
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td> <p>y</p> <td> z</td>
> </td></tr>
> </table>
> </td></tr>
> </table>
> </td></tr>
> <tr><td>
> <table border="0" cellspacing="0" cellpadding="0">
> <tr><td>
> </td></tr>
> </table>
> </td></tr>
> </table>
> """
REs are Regular Expressions, not parsers. There are problems for
which there is no RE solution (I'm not implying that this is the
case in your example).
In any case, complex text processing should be done using tools
better suited to this. In this case, HTMLParser seems like a
reasonable choice.
mfg
Georg
More information about the Python-list
mailing list