search regular expression

Sat Feb 15 09:35:11 EST 2003

On Fri, 14 Feb 2003 03:41:24 +0100, an infinite amount of monkeys
hijacked the computer of Marian Förster
<marian.foerster at informatik.tu-chemnitz.de> and wrote:

>i need a regular expression to found table like:
>
><table cel....>
><tr>
><td><a href=jjjjj><img src="ffff.... alt="Frame verlassen>  sdksdsdkl
>text1
>text2
></td>
></tr>
></table>
>
>
>with
>
>patt=re.compile(>table.*>(?=.*Frame 
>verlassen)(?P<INHALT>.*)</table>",re.S|re.I|re.DOTALL)
>
>and patt.search( table input )
>
>only found text at beginning of <img........
>but I need text at beginning from <table cel....

As Jp Calderone mentioned you should use a parser if possible. However
if for some reason you can't, for example it is terribly badly formed,
then try putting it through something like htmltidy beforehand.

If you still want to use regular expresions you can do something like:
(sorry this code isn't that clear I wrote it a while ago when I first
came to python)

#-----CODE-----
tagStart = r"<table[^>]*>"
tagEnd   = r"</table>"

# group(1) == pre-tagStart
# group(2) == tagStart
# group(3) == between tags
# group(4) == tagEnd
# group(5) == post-tagEnd
veryGreedy = re.compile(r"^(.*)(%s)(.*)(%s)(.*)$"%(
    tagStart, tagEnd), re.MULTILINE | re.DOTALL)

bitGreedy  = re.compile(r"^(.*?)(%s)(.*)(%s)(.*)$"%(
    tagStart, tagEnd), re.MULTILINE | re.DOTALL)

notGreedy  = re.compile(r"^(.*?)(%s)(.*?)(%s)(.*)$"%(
    tagStart, tagEnd), re.MULTILINE | re.DOTALL)
#-----END------

Which expression you use (veryGreedy, bitGreedy, notGreedy) depends on
the shape of your data. E.g.

# simple
htmlSimple = """\
before table 
<table some other tag stuff> 
inside table 
</table> 
after table
"""

any one of the expressions above will match htmlSimple

# nested
htmlNested = """\
before table 1
<table this is table 1> 
some stuff inside table 1
<table this is table 2> 
inside table 2 
</table> 
after table 2
</table>
after table 1
"""

only the bitGreedy will successfully grab table 1, the other
expressions will fail to extract the results properly.

# sequential
htmlSeq = """\
before table 1
<table this is table 1> 
inside table 1
</table> 
after table 1

before 2 table 
<table this is table 2> 
inside table 2
</table> 
after table 2
"""

only the notGreedy will successfully grab table 1, the other
expressions will not work correctly. In this case veryGreedy would
grab table 2, or the "last" table in the text (so long as it's not
nested.)

# nested + sequential
htmlComplex = """\
before table 1
<table this is table 1> 
some stuff inside table 1
<table this is table 2> 
inside table 2 
</table> 
after table 2
</table>
after table 1

before 3 table 
<table this is table 3> 
inside table 3
</table> 
after table 3
"""

no expressions here will correctly match table 1. veryGreedy will
extract table 3, however this is mere coincidence and shouldn't be
relied on. To extract the tables here properly you will need to write
a slightly more complex program (left as an exercise to the reader :)

If your data is any shape other than htmlSimple or htmlSeq or
htmlNested *and all of it is the same shape* then it's probably better
to use the parser (sgml parser is quite forgiving).