html parsing? Or just simple regex'ing?

Dan Stromberg strombrg at dcs.nac.uci.edu
Tue Nov 9 20:37:46 EST 2004


I'm working on writing a program that will synchronize one database with
another.  For the source database, we can just use the python sybase API;
that's nice and normal.

For the target database, unfortunately, the only interface we have access
to, is http+html based.  There's same javascript involved too, but
hopefully we won't have to interact with that.

So, I've got Basic AUTH going with http, but now I'm faced with the
following questions, due to the fact that I need to pull some lists out of
HTML, and then make some changes via POST or so, again over HTTP:

1) Would I be better off just regex'ing the html I'm getting back?  (I
suppose this depends on the complexity of the html received, eh?)

2) Would I be better off feeding the HTML into an HTML parser, and then
traversing that datastructure (is that really how it works?)?

3) When I retrieve stuff over http, it's clear that the web server is
sending some kind of odd gibberish, which the python urllib2 API is
passing on to me.  In a packet trace, it looks like:

Date: Wed, 10 Nov 2004 01:09:47 GMT^M
Server: Apache/1.3.29 (Unix)  (Red-Hat/Linux) mod_perl/1.23^M
Keep-Alive: timeout=15, max=98^M
Connection: Keep-Alive^M
Transfer-Encoding: chunked^M
Content-Type: text/html^M
^M
ef1^M
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">^M
<html>^M
<head>^M

...and so on.  It seems to me that "ef1" should not be there.  Is that
true?  What -is- that nonsense?  It's not the same string every time, and
it doesn't show up in a web browser.


Thanks!




More information about the Python-list mailing list