Efficient counting of results

Steve D'Aprano steve+python at pearwood.info
Thu Oct 19 21:18:53 EDT 2017


On Fri, 20 Oct 2017 05:28 am, Israel Brewster wrote:

> If it helps, my data would look something like this:
> 
> [ (date, key, t1, t2),
>  (date, key, t1, t2)
> .
> .
> ]
> 
> Where the date and the key are what is used to determine what "on-time" is
> for the record, and thus which "late" bin to put it in.

It might help if you discuss it in less generic and more concrete terms. E.g.
are these invoices? The date field could be the invoice date, the "key" field
(aren't they all keys?) might be the due date.

What t1 and t2 are, I have no idea. Your code there suggests that they are
fields in your data records, but the contents of the fields, who knows?


> So if the date of 
> the first record was today, t1 was on-time, and t2 was 5 minutes late, then
> I would need to increment ALL of the following (using your data structure
> from above):
> 
> d10, w10, m10, y10, d25, w25, m25 AND y25

Try using descriptive variable names rather than these cryptic codes.

I don't understand what is *actually* being computed here -- you say that t1
is "on time" and t2 is "5 minutes late", but that's a contradiction: how can
a single record be both on time and 5 minutes late?

It also contradicts your statement that it is *date* and *key* that determines
which late bin to use. Rather, it seems that date and key are irrelevant and
can be ignored, it is only t1 and t2 which determine which late bins to
update.

Another question: you're talking about *dates*, which implies a resolution of
1 day, but then you talk about records being "five minutes late" which
implies a resolution of at least five minutes and probably one minute, if not
seconds or milliseconds. Which is it? My guess is that you're probably
talking about *timestamps* (datetimes) rather than *dates*.


> Since this record counts not just for the current day, but also for
> week-to-date, month-to-date and year-to-date. Basically, as the time
> categories get larger, the percentage of the total records included in that
> date group also gets larger. The year-to-date group will include all
> records, grouped by lateness, the daily group will only include todays
> records.

With the proviso that I have no confidence at all that I understand your
specification or requirements, and that I'm *guessing* a spec that makes
sense to me, I'll suggest a couple of approaches.

Hypotheses: you have data with a timestamp recording when each record
becomes "active" in some sense (e.g. when an invoice becomes due for
payment), and second timestamp representing the date/time "now" (at the start
of the computation?). You want to process all your records, and decide "as of
now, how late is each record", and then report *cumulative* subtotals for a
number of arbitrary groups: not late yet, five minutes late, one day late,
one year late, etc.

Suggestion:

Start with just the "activation time" and "now", and calculate the difference.
If they are both given in seconds, you can just subtract:

    lateness = now - activation_time

to determine how late that record is. If they are Datetime objects, use a
Timedelta object.

That *single* computed field, the lateness, is enough to determine which
subtotals need to be incremented. Start by dividing all of time into named
buckets, in numeric order:

lateness <= 0: on_time
0 < lateness <= five minutes: five_minutes_late
five minutes < lateness <= sixty minutes: one_hour_late
sixty minutes < lateness <= 24 hours: one_day_late
24 hours < lateness <= seven days: one_week_late

etc. Notice that the buckets don't overlap.

(The buckets refer to the *maximum* lateness, which is opposite of standard
book-keeping practice which uses the minimum. In accounting and book-keeping,
if an invoice is 45 days past the due date, it would fall into the single
bucket "30 days past due". When it reaches 60 days past the due date, it
moves into the "60 days past due" bucket, where it will stay until it gets to
90 days overdue.)

Convert the times to a common unit (seconds), and build a list for each
bucket, using the *upper bound* as key:


buckets = [(0, on_time),
           (5*60, five_minutes_late),  # i.e. *up to* five minutes late
           (60*60, one_hour_late),  # i.e. *up to* one hour late
           (24*60*60, one_day_late), 
           ...
           (float('inf'), unbelievably_late)]

The buckets might be lists, to append the record; or integer counts, which you
add 1 to, or subtotals. I'll assume that they are lists, as that is the most
flexible.

Now process your records:

for record in records:
    lateness = now - record.activation_date
    for end, bucket in buckets:
        if lateness <= end:
            bucket.append(record)
        else:
            break

And you're done!

If you want the *number of records* in a particular bucket, you say:

len(bucket)

If you want the total record amount, you say:

sum(record.total for record in bucket)


(assuming your records also have a "total" field, if they're invoices say).


I hope that's even vaguely helpful.




> Maybe that will help clear things up. Or not. :-)


Not even a tiny bit :-(





-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list