HTML Parsing

Ayaz Ahmed Khan ayaz at dev.slash.null
Sun Feb 11 02:05:45 EST 2007


"mtuller" typed:

> I have also tried Beautiful Soup, but had trouble understanding the
> documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

<span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability</span>

>>> import re
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) 
>>> title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
>>> title
u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

-- 
Ayaz Ahmed Khan

A witty saying proves nothing, but saying something pointless gets
people's attention.



More information about the Python-list mailing list