web page text extractor
Andre Engels
andreengels at gmail.com
Thu Jul 12 10:23:33 EDT 2007
2007/7/12, kublai <restycena at gmail.com>:
> For a project, I need to develop a corpus of online news stories. I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?
def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text
--
Andre Engels, andreengels at gmail.com
ICQ: 6260644 -- Skype: a_engels
More information about the Python-list
mailing list