web page text extractor

rdahlstrom roger.dahlstrom at gmail.com
Fri Jul 13 12:05:24 EDT 2007


To maintain paragraphs, replace any p or br tags with your favorite
operating system's crlf.

On Jul 13, 8:57 am, kublai <restyc... at gmail.com> wrote:
> On Jul 13, 5:44 pm, Paul McGuire <pt... at austin.rr.com> wrote:
>
>
>
> > On Jul 12, 4:42 am, kublai <restyc... at gmail.com> wrote:
>
> > > Hello,
>
> > > For a project, I need to develop a corpus of online news stories.  I'm
> > > looking for an application that, given the url of a web page, "copies"
> > > the rendered text of the web page (not the source HTNL text), opens a
> > > text editor (Notepad), and displays the copied text for the user to
> > > examine and save into a text file. Graphics and sidebars to be
> > > ignored. The examples I have come across are much too complex for me
> > > to customize for this simple job. Can anyone lead me to the right
> > > direction?
>
> > > Thanks,
> > > gk
>
> > One of the examples provided with pyparsing is an HTML stripper - view
> > it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.
>
> > -- Paul
>
> Stripping tags is indeed one strategy that came to mind. I'm wondering
> how much information (for example, paragraphing) would be lost, and if
> what would be lost would be acceptable (to the project). I looked at
> pyparsing and I see that it's got a lot of text processing
> capabilities that I can use along the way. I sure will try it. Thanks
> for the post.
>
> Best,
> gk





More information about the Python-list mailing list