web page text extractor

Jon Rosebaugh jon at turnthepage.org
Thu Jul 12 10:22:51 EDT 2007


On 2007-07-12 04:42:25 -0500, kublai <restycena at gmail.com> said:
> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128




More information about the Python-list mailing list