extract news article from web

Gabriel Cosentino de Barros aut_gbarros at uolinc.com
Thu Dec 23 12:25:02 EST 2004


Excel in later offices has the "web query" feature.

(sorry about top posting)

-----Original Message-----
From: Steve Holden [mailto:steve at holdenweb.com]
Sent: quinta-feira, 23 de dezembro de 2004 12:59
To: python-list at python.org
Subject: Re: extract news article from web


Zhang Le wrote:

> Thanks for the hint. The xml-rpc service is great, but I want some
> general techniques to parse news information in the usual html pages.
> 
> Currently I'm looking at a script-based approach found at:
> http://www.namo.com/products/handstory/manual/hsceditor/
> User can write some simple template to extract certain fields from a
> web page. Unfortunately, it is not open source, so I can not look
> inside the blackbox.:-(
> 
> Zhang Le
> 
That's a very large topic, and not one that I could claim to be expert 
on, so let's hope that others will pitch in with their favorite 
techniques. Otherwise it's down to providing individual parsers for each 
service you want to scan, and maintaining the parsers as each group of 
designers modifies their pages.

You might want to look at BeutifulSoup, which is a module for extracting 
  stuff from (possibly) irregularly-formed HTML.

regards
  Steve
-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119
-- 
http://mail.python.org/mailman/listinfo/python-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20041223/a50f4fc1/attachment.html>


More information about the Python-list mailing list