Extracting text from a Webpage using BeautifulSoup

Tue May 27 20:26:51 EDT 2008

On May 27, 5:01 am, Magnus.Morab... at gmail.com wrote:
> Hi,
>
> I wish to extract all the words on a set of webpages and store them in
> a large dictionary. I then wish to procuce a list with the most common
> words for the language under consideration. So, my code below reads
> the page -
>
> http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
>
> a welsh language page. I hope to then establish the 1000 most commonly
> used words in Welsh. The problem I'm having is that
> soup.findAll(text=True) is returning the likes of -
>
> u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'
>
> and -
>
> <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
>
> Any suggestions how I might overcome this problem?
>
> Thanks,
>
> Barry.
>
> Here's my code -
>
> import urllib
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> # proxy_support = urllib2.ProxyHandler({"http":"http://
> 999.999.999.999:8080"})
> # opener = urllib2.build_opener(proxy_support)
> # urllib2.install_opener(opener)
>
> page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
> newsid_7420900/7420967.stm')
> soup = BeautifulSoup(page)
>
> pageText = soup.findAll(text=True)
> print pageText

As an alternative datapoint, you can try out the htmlStripper example
on the pyparsing wiki: http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py

-- Paul