Extracting data from HTML

Mon Jun 3 02:32:35 EDT 2002

On Sun, 2002-06-02 at 22:41, Kragen Sitaker wrote:
> Geoff Gerrietts <geoff at gerrietts.net> writes:
> > Both techniques are worth knowing -- but better than either would be
> > finding a way to get the information you're after via XML-RPC or some
> > other protocol that's designed to carry data rather than rendering
> > instructions.
> 
> You seem to imply that XML-RPC is better suited to carrying data
> rather than rendering instructions than HTTP is.  I disagree with this
> implication, and I adduce the following evidence:
> - the thousands of RSS feeds (see www.syndic8.com) using HTTP
> - people downloading Python via HTTP
> - the fact that XML-RPC runs over HTTP

I think you are misinterpreting Geoff's response, and you seem to have a
chip on your shoulder about it.  He did not compare XML-RPC to HTTP, but
to HTML (at least, that's clearly implicit because this thread was
talking about HTML parsing).  HTML is clearly a poor way to exchange
machine-readable information, there are too many layout-related tags
that are usually only appreciated by humans.

An alternative interface meant for programmatic parsing is clearly
easier and more robust to deal with fetching data, be that with XML-RPC,
or plain HTTP with normal GET/POST variables and XML response -- even a
CSV response, or newline-delimited list, or what have you would be
easier than nearly any HTML you'll find out there.

  Ian