web page text extractor
Alex Popescu
the.mindstorm.mailinglist at gmail.com
Thu Jul 12 11:52:27 EDT 2007
On Jul 12, 5:24 pm, "Andre Engels" <andreeng... at gmail.com> wrote:
> 2007/7/12, Andre Engels <andreeng... at gmail.com>:
>
> I forgot to include
>
> import urllib2, re
>
> here
>
> > def textonly(url):
> > # Get the HTML source on url and give only the main text
> > f = urllib2.urlopen(url)
> > text = f.read()
> > r = re.compile('\<[^\<\>]*\>')
> > newtext = r.sub('',text)
> > while newtext != text:
> > text = newtext
> > newtext = r.sub('',text)
> > return text
>
> --
> Andre Engels, andreeng... at gmail.com
> ICQ: 6260644 -- Skype: a_engels
Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.
bests,
./alex
--
.w( the_mindstorm )p.
More information about the Python-list
mailing list