html parsing? Or just simple regex'ing?
exarkun at intarweb.us
exarkun at intarweb.us
Tue Nov 9 22:34:44 EST 2004
On Wed, 10 Nov 2004 01:37:46 GMT, Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
>
> I'm working on writing a program that will synchronize one database with
> another. For the source database, we can just use the python sybase API;
> that's nice and normal.
>
> For the target database, unfortunately, the only interface we have access
> to, is http+html based. There's same javascript involved too, but
> hopefully we won't have to interact with that.
>
> So, I've got Basic AUTH going with http, but now I'm faced with the
> following questions, due to the fact that I need to pull some lists out of
> HTML, and then make some changes via POST or so, again over HTTP:
>
> 1) Would I be better off just regex'ing the html I'm getting back? (I
> suppose this depends on the complexity of the html received, eh?)
Unlikely. Regular expressions alone cannot parse HTML reliably. If they could, we wouldn't need HTML parsers :)
>
> 2) Would I be better off feeding the HTML into an HTML parser, and then
> traversing that datastructure (is that really how it works?)?
Very probably. There are a few in the standard library, and a whole lot of third party modules with various APIs, each with a particular kind of use in mind.
>
> 3) When I retrieve stuff over http, it's clear that the web server is
> sending some kind of odd gibberish, which the python urllib2 API is
> passing on to me. In a packet trace, it looks like:
Note that the Transfer-Encoding is "chunked". "ef1" is a hex-encoded chunk length prefix. These appear before each group of bytes send to the client. They let you know how much application-level data to expect. Not having used urllib2 extensively, I can't say whether it is normal for you to be getting the chunk lengths from it or not, but it would _seem_ like something you should not be receiving (that is, if I had written urllib2, you would not receive them ;).
Jp
More information about the Python-list
mailing list