html parsing? Or just simple regex'ing?

Wed Nov 10 06:26:04 EST 2004

> For the target database, unfortunately, the only interface we have access
> to, is http+html based.  There's same javascript involved too, but
> hopefully we won't have to interact with that.

Argl. That sounds like royal pain in somewhere...

> So, I've got Basic AUTH going with http, but now I'm faced with the
> following questions, due to the fact that I need to pull some lists out of
> HTML, and then make some changes via POST or so, again over HTTP:
> 
> 1) Would I be better off just regex'ing the html I'm getting back?  (I
> suppose this depends on the complexity of the html received, eh?)
> 
> 2) Would I be better off feeding the HTML into an HTML parser, and then
> traversing that datastructure (is that really how it works?)?

I personally would certainly go that way - the best thing IMHO would be to
make a dom-tree out of the html you then can work on with xpath. 4suite
might be good for that. While this seems a bit overengineered at first,
using xpath allows for pretty strong queries against your dom-tree so even
larger changes in the "interface" can be coped with. And writing htmlparser
based class isn't hard, either.

> 
> 3) When I retrieve stuff over http, it's clear that the web server is
> sending some kind of odd gibberish, which the python urllib2 API is
> passing on to me.  In a packet trace, it looks like:
> 
> Date: Wed, 10 Nov 2004 01:09:47 GMT^M
> Server: Apache/1.3.29 (Unix)  (Red-Hat/Linux) mod_perl/1.23^M
> Keep-Alive: timeout=15, max=98^M
> Connection: Keep-Alive^M
> Transfer-Encoding: chunked^M
> Content-Type: text/html^M
> ^M
> ef1^M
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">^M
> <html>^M
> <head>^M
> 
> ...and so on.  It seems to me that "ef1" should not be there.  Is that
> true?  What -is- that nonsense?  It's not the same string every time, and
> it doesn't show up in a web browser.

A webserver serving http doesn't care what you return - remember that http
can also be used to transfer binary data.

So the problem is not the apache, but whoever wrote that webapplication. Is
it cgi based?

For the ignoring part: Webbrowser tend to be very relaxed about html
documents format, otherwise a lot of the web would be "unsurfable". So I'm
not to astonished that they ignore that leading crap.

-- 
Regards,

Diez B. Roggisch