subclassing collections.Counter

Tue Dec 15 17:18:27 EST 2015

On 15/12/2015 06:55 μμ, Ian Kelly wrote:
> On Tue, Dec 15, 2015 at 10:43 AM, Pavlos Parissis
> <pavlos.parissis at gmail.com> wrote:
>>> If you want your metrics container to act like a dict, then my
>>> suggestion would be to just use a dict, with pseudo-collections for
>>> the values as above.
>>>
>>
>> If I understood you correctly, you are saying store all metrics in a
>> dict and have a counter key as well to store the times metrics are
>> pushed in, and then have a function to do the math. Am I right?
> 
> That would work, although I was actually thinking of something like this:
> 
> class SummedMetric:
>     def __init__(self):
>         self.total = 0
>         self.count = 0
> 
>     @property
>     def average(self):
>         return self.total / self.count
> 
>     def add(self, value):
>         self.total += value
>         self.count += 1
> 
> metrics = {}
> for metric_name in all_metrics:
>     metrics[metric_name] = SummedMetric()
> 
> For averaged metrics, look at metrics['f'].average, otherwise look at
> metrics['f'].total.
> 

With this approach I will have for each metric 1 object, which could
cause performance issues for my case.

Let me bring some context on what I am trying to do here.
I want to provide a fast retrieval and processing of statistics metrics
for HAProxy.

HAProxy exposes stats over a UNIX socket(stats socket).
HAProxy is a multi-process daemon and each process can only be accessed
by a distinct stats socket. There isn't any shared memory for all these
processes. That means that if a frontend or backend is managed by more
than one processes, you have to collect metrics from all processes and
do the sum or average based on type of the metric.

stats are provided in a CSV format:
https://gist.github.com/unixsurfer/ba7e3bb3f3f79dcea686

there is 1 line per frontend and backend. For servers is a bit more
complicated.

When there are 100 lines per process, it is easy to do the work even in
setups with 24 processes(24 *100=2.4K lines). But, there are a lot of
cases where a stats socket will return 10K lines, due to the amount of
backends and servers in backends. This is 240K lines to process and
provide stats per 10secs or 5 secs.

My plan is to split the processing from the collection.
A program will connect to all UNIX sockets asynchronously and dump the
CSV to files, one per socket, and group them by EPOCH time.
It will dump all files under 1 directory which will have as name the
time of the retrieval.

Another program in multi-process mode[1], will pick those files and
parse them in sequentially to perform the aggregation. For this program
I needed the CounterExt.

I will try your approach as well as it is very simple and it does the
work with fewer lines:-) I will compare both in terms of performance and
select the fastest.

Thank you very much for your assistance, very much appreciated.

[1] pseudo-code
from multiprocessing import Process, Queue
import pyinotify

wm = pyinotify.WatchManager()  # Watch Manager
mask = pyinotify.IN_CREATE  # watched events

class EventHandler(pyinotify.ProcessEvent):
    def __init__(self,  queue):
        self.queue = queue

    def process_IN_CREATE(self, event):
        self.queue.put(event.pathname)

def work(queue):
    while True:
        job = queue.get()
        if job == 'STOP':
            break
        print(job)

def main():
    pnum = 10
    queue = Queue()
    plist = []
    for i in range(pnum):
        p = Process(target=work, args=(queue,))
        p.start()
        plist.append(p)

    handler = EventHandler(queue)
    notifier = pyinotify.Notifier(wm, handler)
    wdd = wm.add_watch('/tmp/test', mask, rec=True)
    notifier.loop()

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20151215/4912f614/attachment.sig>