[Tutor] HTML Parsing

Andreas Kostyrka andreas at kostyrka.org
Mon Apr 21 15:34:00 CEST 2008


As usual there are a number of ways.

But I basically see two steps here:

1.) capture all dt elements. If you want to stick with the standard
library, htmllib would be the module. Else you can use e.g.
BeautifulSoup or something comparable.

2.) Check all dt contents either via regex, or with a .startswith and
string manipulations.

Andreas

Am Montag, den 21.04.2008, 13:35 +0100 schrieb Stephen Nelson-Smith:
> Hi,
> 
> I want to write a little script that parses an apache mod_status page.
> 
> I want it to return simple the number of page requests a second and
> the number of connections.
> 
> It seems this is very complicated... I can do it in a shell one-liner:
> 
> curl 10.1.2.201/server-status 2>&1 | grep -i request | grep dt | {
> IFS='> ' read _ rps _; IFS='> ' read _ currRequests _ _ _ _
> idleWorkers _; echo $rps $currRequests $idleWorkers   ; }
> 
> But that's horrid.
> 
> So is:
> 
> $ eval `printf '<dt>3 requests currently being processed, 17 idle
> workers</dt>\n <dt>2.82 requests/sec - 28.1 kB/second - 10.0
> kB/request</dt>\n' | sed -nr '/<dt>/ { N;
> s@<dt>([0-9]*)[^,]*,([0-9]*).*<dt>([0-9.]*).*@workers=$((\1+\2));requests=\3 at p;
> }'`
> $ echo "workers: $workers reqs/secs $requests"
> workers: 20 reqs/sec 2.82
> 
> The page looks like this:
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <html><head>
> <title>Apache Status</title>
> </head><body>
> <h1>Apache Server Status for 10.1.2.201</h1>
> 
> <dl><dt>Server Version: Apache/2.0.46 (Red Hat)</dt>
> <dt>Server Built: Aug  1 2006 09:25:45
> </dt></dl><hr /><dl>
> <dt>Current Time: Monday, 21-Apr-2008 14:29:44 BST</dt>
> <dt>Restart Time: Monday, 21-Apr-2008 13:32:46 BST</dt>
> <dt>Parent Server Generation: 0</dt>
> <dt>Server uptime:  56 minutes 58 seconds</dt>
> <dt>Total accesses: 10661 - Total Traffic: 101.5 MB</dt>
> <dt>CPU Usage: u6.03 s2.15 cu0 cs0 - .239% CPU load</dt>
> <dt>3.12 requests/sec - 30.4 kB/second - 9.7 kB/request</dt>
> <dt>9 requests currently being processed, 11 idle workers</dt>
> </body></html>
> 
> How can/should I do this?
> 
> S.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
Url : http://mail.python.org/pipermail/tutor/attachments/20080421/b22cca84/attachment.pgp 


More information about the Tutor mailing list