web page text extractor
Miki
miki.tebeka at gmail.com
Thu Jul 12 09:48:06 EDT 2007
Hello jk,
> For a project, I need to develop a corpus of online news stories. I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?
Going simple :)
from os import system
from sys import argv
OUTFILE = "geturl.txt"
system("lynx -dump %s > %s" % (argv[1], OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)
Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.
HTH,
--
Miki <miki.tebeka at gmail.com>
http://pythonwise.blogspot.com
More information about the Python-list
mailing list