Extending/embedding versus separation

Thu Mar 28 13:50:03 EST 2002

On Thu, 28 Mar 2002 11:12:09 -0000, "skoria" <shomon at softhome.net> wrote:

>Hi
>
>Thanks for your help. No, I didn't write the hash tables. I'm writing
>a stats program based on webalizer. Webalizer and Analog were the
>fastest and least memory hungry of all the programs I know of, and
>webalizer is GPL so I was able to use it as a basis for my work, so I
>get to make free software at work! The hash tables are already written
>there. 
>
>This program is intended to run on all the sites hosted by the company
>I work for, and also on sites not hosted by it, hence the memory/disk
>space concerns. 
>
>The way I see it, the python side of things will be a python script
>that imports the C parts, and can therefore process the logfiles and
>turn them into hash tables. These are then fed back to the python
>part, which processes them further, and turns them into graphs and html. 
>
>The reason for this is I need to be able to quickly develop and expand
>on the reports produced, so we can do complicated things like, say,
>visitor paths or percentage increase graphs.
>
OTTOMH, I would guess that you may need to generate some special
representations of visitor paths directly from the logs, unless
the webalizer already does that. But if you have to process the
raw logs in a new way, it will most likely be easier to get right
in Python than by cannibalizing existing C.

I'd make sure the raw logs contain the data you actually
need for your purposes. Hashes are easy in python. E.g., here is a
tiny example counting hits on my little private net (this is done
on the server, which is an old P90 with 48mb ram and a slow pio disk,
running slackware linux and apache). Then I ran the same thing on
NT4 P2 300mhz with 320mb ram, which is not a real screamer these days,
but went 10x faster ;-): Also the python on the slackware box is 1.5.2
vs 2.2 on NT, so I used only 1.5.2 features (something to consider).

=======< snip >========= 
~/misc$ cat hosthits.py
#!/usr/bin/python
import string, sys, time
tstart = time.clock()
f = open(sys.argv[1])
d = {}
while 1:
    host = string.split(f.readline(),' ',1)[0]
    if not host: break
    if d.has_key(host):
        d[host] = d[host] + 1
    else:
        d[host] = 1
it = d.items()
it.sort()
hits = 0
for k,v in it:
    hits = hits + v
    print '%6d hits came from %s' % (v,k)
print '------\n%6d total hits.' % hits
print '\nTotal time: ', time.clock()-tstart

~/misc$ v /var/log/access_log
-rw-r--r--   1 root     root      1699460 Mar 28 11:00 /var/log/access_log
~/misc$ python hosthits.py /var/log/access_log
  1913 hits came from 192.168.1.100
 15240 hits came from 192.168.2.1
     6 hits came from 192.168.2.3
------
 17159 total hits.

Total time:  13.83
=======< snip >========= 

[10:42] C:\pywk\misc>python hosthits.py access_log
  1912 hits came from 192.168.1.100
 15240 hits came from 192.168.2.1
     6 hits came from 192.168.2.3
------
 17158 total hits.

Total time:  1.2641499788

About 10x faster, and about 13,500 log lines/sec.
Recent boxes are a good bit faster yet. So if you
have any kind of recent computer, you
can crunch logs pretty easily. For prototyping,
I'd say don't worry about any C advantage in hashing.
Python does the actual hash work in C anyway.

>As I see it now, the only difference I get if extending python with my
>adapted webalizer, is that I save writing a little output to file and
>parsing it into python again.   
>
>Is this going to be worthwhile?
>
A lot depends on what you are fluent in doing. But it may be less work
re-doing some parsing/hash-generation in python than capturing some C
stuff via extensions -- and then having to do custom processing of
the raw logs anyway (if you forsee that).

My bet is you'd be better off doing the whole thing in Python, and
then worrying about optimizing. IOW, I second John's advice.

Regards,
Bengt Richter