[Web-SIG] Extracting web data

Aaron Watters arw1961 at yahoo.com
Tue Feb 22 01:52:06 CET 2011


BeautifulSoup is the standard response.
I think lxml will not work very well unless the
html is extremely nicely formatted, but I could
be wrong.

For what you describe I would suggest developing
seat-of-the-pants heuristics -- just get the page
using httplib and then use string.find liberally.
I've had at least three consulting gigs solving
this problems using various techniques and the general
problem is quite difficult, but if you are trying to
parse just a few pages in simple ways developing
special purpose heuristics is pretty easy (until they
redesign the pages, which they will do every so often).

Best of luck, -- Aaron Watters

btw: If you have lots of money to spend on this
  my former client connotate.com does this sort
  of scraping (and I developed some of the code).

--- On Mon, 2/21/11, James Mills <prologic at shortcircuit.net.au> wrote:

From: James Mills <prologic at shortcircuit.net.au>
Subject: Re: [Web-SIG] Extracting web data
To: "web-sig" <web-sig at python.org>
Date: Monday, February 21, 2011, 7:07 PM

On Mon, Feb 21, 2011 at 2:21 PM, Deb Midya <debmidya at yahoo.com> wrote:


Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids). 
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.
You might want to look into using eitherthe lxml or BeautifulSoup modules.


cheersJames
-- 
-- James Mills
--
-- "Problems are solved by method"


-----Inline Attachment Follows-----

_______________________________________________
Web-SIG mailing list
Web-SIG at python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/arw1961%40yahoo.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20110221/fa94fdd2/attachment.html>


More information about the Web-SIG mailing list