Extracting text from a Webpage using BeautifulSoup

Tue May 27 08:06:53 EDT 2008

On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
> > I wish to extract all the words on a set of webpages and store them in
> > a large dictionary. I then wish to procuce a list with the most common
> > words for the language under consideration. So, my code below reads
> > the page -
>
> >http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
>
> > a welsh language page. I hope to then establish the 1000 most commonly
> > used words in Welsh. The problem I'm having is that
> > soup.findAll(text=True) is returning the likes of -
>
> > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
> >www.w3.org/TR/REC-html40/loose.dtd"'
>
> Just extract the text from the body of the document.
>
> body_texts = soup.body(text=True)
>
> > and -
>
> > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
>
> > Any suggestions how I might overcome this problem?
>
> Ask the BBC to produce HTML that's less buggy.  ;-)
>
> http://validator.w3.org/reports bugs like "'body' tag not allowed here"
> or closing tags without opening ones and so on.
>
> Ciao,
>         Marc 'BlackJack' Rintsch

Great, thanks!