HTML Code - Line Number

Tim Roberts timr at probo.com
Sat Apr 28 01:59:31 EDT 2012


SMac2347 at comcast.net wrote:
>
>For scrapping purposes, I am having a bit of trouble writing a block
>of code to define, and find, the relative position (line number) of a
>string of HTML code. I can pull out one string that I want, and then
>there is always a line of code, directly beneath the one I can pull
>out, that begins with the following:
><td align="left" valign="top" class="body_cols_middle">
>
>However, because this string of HTML code above is not unique to just
>the information I need (which I cannot currently pull out), I was
>hoping there is a way to effectively say "if you find the html string
>_____ in the line of HTML code above, and the string <td align="left"
>valign="top" class="body_cols_middle"> in the line immediately
>following, then pull everything that follows this second string.

Regular expression-based screen scraping is extremely delicate.  All it
takes is one tweak to the HTML, and your scraping fails although the page
continues to look the same.

A much better plan is to use sgmllib to write yourself a mini HTML parser.
You can handle "td" tags with the attributes you want, and count down until
you get to the "td" tag you want.
-- 
Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.



More information about the Python-list mailing list