Regular expression to structure HTML

Fri Oct 2 09:38:16 EDT 2009

On Oct 2, 1:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
>
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.
>
> No -- sorry -- I don't want to use BeautifulSoup (though I have for
> other projects). Humor me, please -- I'd really like to see if this
> can be done with just regular expressions.
>
> Note that the output is referenced using named groups.
>
> My challenge is successfully matching the HTML tags in between the
> first table row, and the second table row.
>
> I'd appreciate any suggestions to improve the approach.
>
> rText = "<tr><td valign=top>8583</td><td valign=top><a
> href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
> tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
> lic_number=9371>Career Learning Center</a></td><td
> valign=top>Jefferson</td><td valign=top>70113</td></tr>"
>
> rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
> valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
> Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
> \g<zname>\n', rText)
>
> print rText
>
> LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
> valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
> valign=top>9371</td><td valign=top><a href=lic_details.asp?
> lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

Some suggestions to start off with:

  * triple-quote your multiline strings
  * consider using the re.X, re.M, and re.S options for re.compile()
  * save your re object after you compile it
  * note that re.sub() returns a new string

Also, it sounds like you want to replace the first 2 <td> elements for
each <tr> element with their content separated by a pipe (throwing
away the <td> tags themselves), correct?

---John