[Tutor] HTML Parsing

Stephen Nelson-Smith sanelson at gmail.com
Mon Apr 21 14:35:49 CEST 2008


Hi,

I want to write a little script that parses an apache mod_status page.

I want it to return simple the number of page requests a second and
the number of connections.

It seems this is very complicated... I can do it in a shell one-liner:

curl 10.1.2.201/server-status 2>&1 | grep -i request | grep dt | {
IFS='> ' read _ rps _; IFS='> ' read _ currRequests _ _ _ _
idleWorkers _; echo $rps $currRequests $idleWorkers   ; }

But that's horrid.

So is:

$ eval `printf '<dt>3 requests currently being processed, 17 idle
workers</dt>\n <dt>2.82 requests/sec - 28.1 kB/second - 10.0
kB/request</dt>\n' | sed -nr '/<dt>/ { N;
s@<dt>([0-9]*)[^,]*,([0-9]*).*<dt>([0-9.]*).*@workers=$((\1+\2));requests=\3 at p;
}'`
$ echo "workers: $workers reqs/secs $requests"
workers: 20 reqs/sec 2.82

The page looks like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html><head>
<title>Apache Status</title>
</head><body>
<h1>Apache Server Status for 10.1.2.201</h1>

<dl><dt>Server Version: Apache/2.0.46 (Red Hat)</dt>
<dt>Server Built: Aug  1 2006 09:25:45
</dt></dl><hr /><dl>
<dt>Current Time: Monday, 21-Apr-2008 14:29:44 BST</dt>
<dt>Restart Time: Monday, 21-Apr-2008 13:32:46 BST</dt>
<dt>Parent Server Generation: 0</dt>
<dt>Server uptime:  56 minutes 58 seconds</dt>
<dt>Total accesses: 10661 - Total Traffic: 101.5 MB</dt>
<dt>CPU Usage: u6.03 s2.15 cu0 cs0 - .239% CPU load</dt>
<dt>3.12 requests/sec - 30.4 kB/second - 9.7 kB/request</dt>
<dt>9 requests currently being processed, 11 idle workers</dt>
</body></html>

How can/should I do this?

S.


More information about the Tutor mailing list