Regular expression to structure HTML

Fri Oct 2 01:10:55 EDT 2009

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Note that the output is referenced using named groups.

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

I'd appreciate any suggestions to improve the approach.

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

print rText

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113