Parsing HTML

Wed Feb 14 08:13:05 EST 2007

mtuller wrote:
> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:
>
> <tr >
> <td headers="col1_1"  style="width:21%"   >
> <span  class="hpPageText" >LETTER</span></td>
> <td headers="col2_1"  style="width:13%; text-align:right"   >
> <span  class="hpPageText" >33,699</span></td>
> <td headers="col3_1"  style="width:13%; text-align:right"   >
> <span  class="hpPageText" >1.0</span></td>
> <td headers="col4_1"  style="width:13%; text-align:right"   >
> </tr>
>
> What is show is only a small section.
>
> I want to extract the 33,699 (which is dynamic) and set the value to a
> variable so that I can insert it into a database. I have tried parsing
> the html with pyparsing, and the examples will get it to print all
> instances with span, of which there are a hundred or so when I use:
>
> for srvrtokens in printCount.searchString(printerListHTML):
> 	print srvrtokens
>
> If I set the last line to srvtokens[3] I get the values, but I don't
> know grab a single line and then set that as a variable.
>
> I have also tried Beautiful Soup, but had trouble understanding the
> documentation, and HTMLParser doesn't seem to do what I want. Can
> someone point me to a tutorial or give me some pointers on how to
> parse html where there are multiple lines with the same tags and then
> be able to go to a certain line and grab a value and set a variable's
> value to that?
>
>
> Thanks,
>
> Mike
>
>   
Posted problems rarely provide exhaustive information. It's just not 
possible. I have been taking shots in the dark of late suggesting a 
stream-editing approach to extracting data from htm files. The 
mainstream approach is to use a parser (beautiful soup or pyparsing).
      Often times nothing more is attempted than the location and 
extraction of some text irrespective of page layout. This can sometimes 
be done with a simple regular expression, or with a stream editor if a 
regular expression gets too unwieldy. The advantage of the stream editor 
over a parser is that it doesn't mobilize an arsenal of unneeded 
functionality and therefore tends to be easier, faster and shorter to 
implement. The editor's inability to understand structure isn't a 
shortcoming when structure doesn't matter and can even be an advantage 
in the presence of malformed input that sends a parser on a tough and 
potentially hazardous mission for no purpose at all.
      SE doesn't impose the study of massive documentation, nor the 
memorization of dozens of classes, methods and what not. The following 
four lines would solve the OP's problem (provided the post really is all 
there is to the problem):

 >>> import re, SE    # http://cheeseshop.python.org/pypi/SE/2.3

 >>> Filter = SE.SE ('<EAT> "~(?i)col[0-9]_[0-9](.|\n)*?/td>~==SOME 
SPLIT MARK"')

 >>> r = re.compile ('(?i)(col[0-9]_[0-9])(.|\n)*?([0-9,]+)</span')

 >>> for line in Filter (s).split ('SOME SPLIT MARK'):
      print r.search (line).group (1, 3)

('col2_1', '33,699')
('col3_1', '0')
('col4_1', '7,428')

-----------------------------------------------------------------------

Input:

 >>> s = '''
<td headers="col1_1"  style="width:21%"   >
<span  class="hpPageText" >LETTER</span></td>
<td headers="col2_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >33,699</span></td>
<td headers="col3_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >1.0</span></td>
<td headers="col5_1"  style="width:13%; text-align:right"   >
<span  class="hppagetext" >7,428</span></td>
</tr>'''

The SE object handles file input too:

 >>> for line in Filter ('file_name', '').split ('SOME SPLIT MARK'):  # 
'' commands string output
      print r.search (line).group (1, 3)