web page text extractor

Thu Jul 12 10:23:33 EDT 2007

2007/7/12, kublai <restycena at gmail.com>:

> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?

def textonly(url):
   # Get the HTML source on url and give only the main text
   f = urllib2.urlopen(url)
   text = f.read()
   r = re.compile('\<[^\<\>]*\>')
   newtext = r.sub('',text)
   while newtext != text:
      text = newtext
      newtext = r.sub('',text)
   return text


-- 
Andre Engels, andreengels at gmail.com
ICQ: 6260644  --  Skype: a_engels