web page text extractor

Paul McGuire ptmcg at austin.rr.com
Fri Jul 13 05:44:51 EDT 2007


On Jul 12, 4:42 am, kublai <restyc... at gmail.com> wrote:
> Hello,
>
> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?
>
> Thanks,
> gk

One of the examples provided with pyparsing is an HTML stripper - view
it online at http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul




More information about the Python-list mailing list