web page text extractor

Alex Popescu the.mindstorm.mailinglist at gmail.com
Thu Jul 12 11:52:27 EDT 2007


On Jul 12, 5:24 pm, "Andre Engels" <andreeng... at gmail.com> wrote:
> 2007/7/12, Andre Engels <andreeng... at gmail.com>:
>
> I forgot to include
>
> import urllib2, re
>
> here
>
> > def textonly(url):
> >    # Get the HTML source on url and give only the main text
> >    f = urllib2.urlopen(url)
> >    text = f.read()
> >    r = re.compile('\<[^\<\>]*\>')
> >    newtext = r.sub('',text)
> >    while newtext != text:
> >       text = newtext
> >       newtext = r.sub('',text)
> >    return text
>
> --
> Andre Engels, andreeng... at gmail.com
> ICQ: 6260644  --  Skype: a_engels

Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.

bests,
./alex
--
.w( the_mindstorm )p.





More information about the Python-list mailing list