Efficient counting of results

Fri Oct 20 13:05:15 EDT 2017

On Oct 19, 2017, at 5:18 PM, Steve D'Aprano <steve+python at pearwood.info> wrote:
> What t1 and t2 are, I have no idea. Your code there suggests that they are
> fields in your data records, but the contents of the fields, who knows?

t1 and t2 are *independent* timestamp fields. My apologies - I made the obviously false assumption that it was clear they were timestamps, or at least times based on the fact I was calculating "minutes late" based on them.

> 
> 
>> d10, w10, m10, y10, d25, w25, m25 AND y25
> 
> Try using descriptive variable names rather than these cryptic codes.

I did. In my original post I showed the table with names like "t1 1-5min". Granted, that's for illustration purposes, not actual code, but still, more descriptive. These codes were just to keep consistent with the alternative data format suggested :-)

> 
> I don't understand what is *actually* being computed here -- you say that t1
> is "on time" and t2 is "5 minutes late", but that's a contradiction: how can
> a single record be both on time and 5 minutes late?

Easily: because the record contains two DIFFERENT times. Since you want more concrete, we're talking departure and arrival times here. Quite easy to depart on-time, but arrive late, or depart late but arrive on-time.

> It also contradicts your statement that it is *date* and *key* that determines
> which late bin to use.

I never made such a statement. I said they are used to determine "WHAT on-time IS for the record", not WHETHER the record is on-time or not, and certainly not which late bin to use. To put it a different way, those are the key to a lookup table that tells me what T1 and T2 are *supposed* to be in order for *each one* to be on time.

So, for example, to completely make up some data (since it doesn't matter in the slightest), date could be 10/5/17 with a key of 42 (Let's say that is a driver ID to keep things concrete for you), and using those values tells me (via the lookup table) that on 10/5/17, 42 should have a T1 of 10:15 and a T2 of 11:30. As we said, those would be departure and arrival times, so what we're saying is that on 10/5, driver #42 was *scheduled* to depart at 10:15 and arrive at their destination at 11:30. So if T1 was *actually* 10:14, and T2 was, say 11:35, then I could say that T1 was on-time (actually, a minute early, but that doesn't matter), while T2 was 5 minutes late. Maybe traffic was horrible, or he had a flat. 

However, if the date changed to 9/1/17 (with the key still being 42), there could be a completely different schedule, with completely different "late" results, even if the *actual* values of t1 and t2 don't change. Maybe he was supposed to make the run early on that day, say 10:00-11:15, but forgot and left at the same time as he was used to, thereby making him 14 minutes late departing. and really late arriving, or something.

> Rather, it seems that date and key are irrelevant and
> can be ignored, it is only t1 and t2 which determine which late bins to
> update.

Except that then we have no way to know what t1 and t2 *should* be. You apparently made the assumption that t1 and t2 should always be some fixed value. In fact, what t1 and t2 should be varies based on date and key (see the driver example above, or Chris Angelico's pizza example also works well). For any given date, there are dozens of different keys with different expected values of t1 and t2 (in the pizza example Chris gave the key might be order number), and for any given key, the expected value of t1 and t2 could vary based on what date it is (say we restart order numbers from 1 each day to make it easy to know how many orders we've done that day, or, of course, same driver different day, depending on which example you prefer).

> 
> Another question: you're talking about *dates*, which implies a resolution of
> 1 day, but then you talk about records being "five minutes late" which
> implies a resolution of at least five minutes and probably one minute, if not
> seconds or milliseconds. Which is it? My guess is that you're probably
> talking about *timestamps* (datetimes) rather than *dates*.

As stated, the data has two timestamp fields T1 and T2. So yes, the resolution of the data is "one minute" (we ignore sub-minute timings). However (and this addresses your understanding below as well), I am trying to get data for the date, week-to-date, month-to-date, and year-to-date. So there is four different "date" resolution bins in addition to the "minute" resolution bins.

Perhaps a better approach to explaining is to pose the question the report is trying to answer:

For the given date, how many departures were on time? How many were 1-5 minutes late? 6-15 minutes late? What about this week: how many on-time, 1-5 minutes late, etc? What about this entire month (including the given date)? What about this year (again, including the given date and month)? How about arrivals - same questions. 

As you can hopefully see now, if a departure happened this week, it probably also happened this month (although that is not necessarily the case, since weeks can cross month boundaries), and if it happened this date or this month, it *definitely* happened this year. As such, a given departure *likely* will be counted in multiple date "groups", if you will.

The end result should be a table like the one I posted in the original question: time frame covered on the horizontal axis (YTD, MTD etc.), and "late" groups for T1 and T2 on the vertical.

> You want to process all your records, and decide "as of
> now, how late is each record", and then report *cumulative* subtotals for a
> number of arbitrary groups: not late yet, five minutes late, one day late,
> one year late, etc.

Just to clarify, as stated, the late groups are not-late, 1-5 minutes late, and 6-15 minutes late. Also as stated in the original message, anything over 15 minutes late is dealt with separately, and therefore ignored for the purposes of this report.

> 
> Suggestion:
> 
> Start with just the "activation time" and "now", and calculate the difference.
> If they are both given in seconds, you can just subtract:
> 
>    lateness = now - activation_time
> 
> to determine how late that record is. If they are Datetime objects, use a
> Timedelta object.
> 
> That *single* computed field, the lateness, is enough to determine which
> subtotals need to be incremented.

Well, *two* computed fields, one for T1 and one for T2, which are counted separately.

> Start by dividing all of time into named
> buckets, in numeric order:
> 
> ...
> for record in records:
>    lateness = now - record.activation_date
>    for end, bucket in buckets:
>        if lateness <= end:
>            bucket.append(record)
>        else:
>            break
> 
> And you're done!
> 
> If you want the *number of records* in a particular bucket, you say:
> 
> len(bucket)
> 
> If you want the total record amount, you say:
> 
> sum(record.total for record in bucket)
> 
> 
> (assuming your records also have a "total" field, if they're invoices say).
> 
> 
> I hope that's even vaguely helpful.

In a sense, in that it supports my initial approach. 

As Stefan Ram pointed out, there is nothing wrong with the solution I have: simply using if statements around the calculated lateness of t1 and t2 to increment the appropriate counters. I was just thinking there might be tools to make the job easier/cleaner/more efficient. From the responses I have gotten, it would seem that that is likely not the case, so I'll just say "thank you all for your time", and let the matter rest.

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
> 
> 
> 
>> Maybe that will help clear things up. Or not. :-)
> 
> 
> Not even a tiny bit :-(
> 
> 
> 
> 
> 
> -- 
> Steve
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list