[Tutor] HTML Parsing

Tue Apr 22 00:51:36 CEST 2008

Stephen Nelson-Smith wrote:
> Hi,
> 
> I want to write a little script that parses an apache mod_status page.
> 
> I want it to return simple the number of page requests a second and
> the number of connections.

> The page looks like this:
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <html><head>
> <title>Apache Status</title>
> </head><body>
> <h1>Apache Server Status for 10.1.2.201</h1>
> 
> <dl><dt>Server Version: Apache/2.0.46 (Red Hat)</dt>
> <dt>Server Built: Aug  1 2006 09:25:45
> </dt></dl><hr /><dl>
> <dt>Current Time: Monday, 21-Apr-2008 14:29:44 BST</dt>
> <dt>Restart Time: Monday, 21-Apr-2008 13:32:46 BST</dt>
> <dt>Parent Server Generation: 0</dt>
> <dt>Server uptime:  56 minutes 58 seconds</dt>
> <dt>Total accesses: 10661 - Total Traffic: 101.5 MB</dt>
> <dt>CPU Usage: u6.03 s2.15 cu0 cs0 - .239% CPU load</dt>
> <dt>3.12 requests/sec - 30.4 kB/second - 9.7 kB/request</dt>
> <dt>9 requests currently being processed, 11 idle workers</dt>
> </body></html>
> 
> How can/should I do this?

For data this predictable, simple regex matching will probably work fine.

If 'data' is the above text, then this seems to get what you want:

In [17]: import re
In [18]: re.search(r'[\d.]+ requests/sec', data).group()
Out[18]: '3.12 requests/sec'
In [19]: re.search(r'\d+ requests currently being processed', data).group()
Out[19]: '9 requests currently being processed'

Kent