HTML Parsing

Sun Feb 11 03:20:46 EST 2007

On Feb 11, 6:05 pm, Ayaz Ahmed Khan <a... at dev.slash.null> wrote:
> "mtuller" typed:
>
> > I have also tried Beautiful Soup, but had trouble understanding the
> > documentation
>
> As Gabriel has suggested, spend a little more time going through the
> documentation of BeautifulSoup. It is pretty easy to grasp.
>
> I'll give you an example: I want to extract the text between the
> following span tags in a large HTML source file.
>
> <span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability</span>
>
> >>> import re
> >>> from BeautifulSoup import BeautifulSoup
> >>> from urllib2 import urlopen
> >>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
> >>> title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
> >>> title
>
> u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'
>

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """
<tr >
<td headers="col1_1"  style="width:21%"   >
<span  class="hpPageText" >LETTER</span></td>
<td headers="col2_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >33,699</span></td>
<td headers="col3_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >1.0</span></td>
</tr>
"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
    key = elem.get('headers')
    assert elem[0].tag == 'span'
    value = elem[0].text
    print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John