web page text extractor

Fri Jul 13 08:57:38 EDT 2007

On Jul 13, 5:44 pm, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jul 12, 4:42 am, kublai <restyc... at gmail.com> wrote:
>
> > Hello,
>
> > For a project, I need to develop a corpus of online news stories.  I'm
> > looking for an application that, given the url of a web page, "copies"
> > the rendered text of the web page (not the source HTNL text), opens a
> > text editor (Notepad), and displays the copied text for the user to
> > examine and save into a text file. Graphics and sidebars to be
> > ignored. The examples I have come across are much too complex for me
> > to customize for this simple job. Can anyone lead me to the right
> > direction?
>
> > Thanks,
> > gk
>
> One of the examples provided with pyparsing is an HTML stripper - view
> it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.
>
> -- Paul

Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk