HTML Parsing

Sun Feb 11 02:05:45 EST 2007

"mtuller" typed:

> I have also tried Beautiful Soup, but had trouble understanding the
> documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

<span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability</span>

>>> import re
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) 
>>> title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
>>> title
u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

-- 
Ayaz Ahmed Khan

A witty saying proves nothing, but saying something pointless gets
people's attention.