Regular expression to structure HTML

Fri Oct 2 15:05:10 EDT 2009

Screw:

>>> html = """  <tr>

    <td valign=top>14313
    </td>

    <td valign=top><a href=lic_details.asp?lic_number=14313>Python
Hammer Institute #2</a>
    </td>

    <td valign=top>Jefferson
    </td>

    <td valign=top>70114
    </td>

  </tr>

  <tr>

    <td valign=top>8583
    </td>

    <td valign=top><a href=lic_details.asp?lic_number=8583>New
Screwdriver Technical Academy, Inc #4</a>
    </td>

    <td valign=top>Jefferson
    </td>

    <td valign=top>70114
    </td>

  </tr>

  <tr>

    <td valign=top>9371
    </td>

    <td valign=top><a href=lic_details.asp?lic_number=9371>Career
RegEx Center</a>
    </td>

    <td valign=top>Jefferson
    </td>

    <td valign=top>70113
    </td>

  </tr>"""

Hammer:

First remove line returns.
Then remove extra spaces.
Then insert a line return to restore logical rows on each </tr><tr>
combination. For more information, see: http://www.qc4blog.com/?p=55

>>> s = re.sub(r'\n','', html)
>>> s = re.sub(r'\s{2,}', '', s)
>>> s = re.sub('(</tr>)(<tr>)', r'\1\n\2', s)
>>> print s
<tr><td valign=top>14313</td><td valign=top><a href=lic_details.asp?
lic_number=14313>Python Hammer Institute #2</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr>
<tr><td valign=top>8583</td><td valign=top><a href=lic_details.asp?
lic_number=8583>New Screwdriver Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr>
<tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career RegEx Center</a></td><td valign=top>Jefferson</
td><td valign=top>70113</td></tr>
>>> p = re.compile(r"(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td valign=top>)(<a href=lic_details\.asp)(\?lic_number=\d+)(>)(?P<zname>[\s\S\WA-Za-z0-9]*?)(</a>)(</td>)(?:<td valign=top>)(?P<zparish>[\s\WA-Za-z]+)(</td>)(<td valign=top>)(?P<zzip>\d+)(</td>)(</tr>)$", re.M)
>>> n = p.sub(r'LICENSE:\g<zlicense>|NAME:\g<zname>|PARISH:\g<zparish>|ZIP:\g<zzip>', s)
>>> print n
LICENSE:14313|NAME:Python Hammer Institute #2|PARISH:Jefferson|ZIP:
70114
LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4|
PARISH:Jefferson|ZIP:70114
LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113
>>>

The solution was to escape the period in the ".asp" string, e.g.,
"\.asp". I also had to limit the pattern in the <zname> grouping by
using a "?" qualifier to limit the "greediness" of the "*" pattern
metacharacter.

Now, who would like to turn that re.compile pattern into a MULTILINE
expression, combining the re.M and re.X flags?

Documentation says that one should be able to use the bitwise OR
operator (e.g., re.M | re.X), but I sure couldn't get it to work.

Sometimes a hammer actually is the right tool if you hit the screw
long and hard enough.

I think I'll try to hit some more screws with my new hammer.

Good day.

On Oct 2, 12:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
>
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.
>
> No -- sorry -- I don't want to use BeautifulSoup (though I have for
> other projects). Humor me, please -- I'd really like to see if this
> can be done with just regular expressions.
>
> Note that the output is referenced using named groups.
>
> My challenge is successfully matching the HTML tags in between the
> first table row, and the second table row.
>
> I'd appreciate any suggestions to improve the approach.
>
> rText = "<tr><td valign=top>8583</td><td valign=top><a
> href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
> tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
> lic_number=9371>Career Learning Center</a></td><td
> valign=top>Jefferson</td><td valign=top>70113</td></tr>"
>
> rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
> valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
> Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
> \g<zname>\n', rText)
>
> print rText
>
> LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
> valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
> valign=top>9371</td><td valign=top><a href=lic_details.asp?
> lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113