web page text extractor

Thu Jul 12 09:48:06 EDT 2007

Hello jk,

> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?
Going simple :)

    from os import system
    from sys import argv

    OUTFILE = "geturl.txt"
    system("lynx -dump %s > %s" % (argv[1], OUTFILE))
    system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,
--
Miki <miki.tebeka at gmail.com>
http://pythonwise.blogspot.com