html parsing? Or just simple regex'ing?

Tue Nov 9 22:34:44 EST 2004

On Wed, 10 Nov 2004 01:37:46 GMT, Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
>
> I'm working on writing a program that will synchronize one database with
> another.  For the source database, we can just use the python sybase API;
> that's nice and normal.
> 
> For the target database, unfortunately, the only interface we have access
> to, is http+html based.  There's same javascript involved too, but
> hopefully we won't have to interact with that.
> 
> So, I've got Basic AUTH going with http, but now I'm faced with the
> following questions, due to the fact that I need to pull some lists out of
> HTML, and then make some changes via POST or so, again over HTTP:
> 
> 1) Would I be better off just regex'ing the html I'm getting back?  (I
> suppose this depends on the complexity of the html received, eh?)

  Unlikely.  Regular expressions alone cannot parse HTML reliably.  If they could, we wouldn't need HTML parsers :)

> 
> 2) Would I be better off feeding the HTML into an HTML parser, and then
> traversing that datastructure (is that really how it works?)?

  Very probably.  There are a few in the standard library, and a whole lot of third party modules with various APIs, each with a particular kind of use in mind.

> 
> 3) When I retrieve stuff over http, it's clear that the web server is
> sending some kind of odd gibberish, which the python urllib2 API is
> passing on to me.  In a packet trace, it looks like:

  Note that the Transfer-Encoding is "chunked".  "ef1" is a hex-encoded chunk length prefix.  These appear before each group of bytes send to the client.  They let you know how much application-level data to expect.  Not having used urllib2 extensively, I can't say whether it is normal for you to be getting the chunk lengths from it or not, but it would _seem_ like something you should not be receiving  (that is, if I had written urllib2, you would not receive them ;).

  Jp