[Distutils] Deprecating download counts in API?

Mon Mar 6 06:34:10 EST 2017

> On Mar 6, 2017, at 1:41 AM, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
> 
> 
>> On Mar 5, 2017, at 8:51 AM, Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>> 
>> So, as most folks are aware PyPI has long had a cumulative download count available in it’s API. This has been on and off again broken for a *long* time and arguably the numbers in there have been “wrong” even when it was working because we had no way to reproduce them from scratch (and thus whenever a bug occurred we’d flat out lose data or add incorrect data with no way to correct it).
>> 
>> In the meantime, we’ve gotten a much better source of querying for download information available inside of Google’s BigQuery database [1][2]. Not only is this able to be recreated “from scratch” so we can, if needed, fix massive data bugs but it provides MUCH more information than the previous downloads and a very powerful query language to go along with it.
>> 
>> Unless there is some sort of massive outcry, I plan to deprecate and ultimately remove the download counts available in the PyPI API, instead preferring people to start using the BigQuery data instead. This more or less reflects the current state of things, since it has been on and off broken (typically broken) for something like a year now.
> 
> I fully realize that if I really wanted this, I could do it myself, and the last thing you need is someone signing you up for more work :).  But, as someone who's been vaguely annoyed that `vanity` doesn't work for a while, I wonder: shouldn't it be easy for someone familiar with both systems to simply implement the existing "download count" API as a legacy / compatibility wrapper around BigQuery?  If that isn't trivial, doesn't that point to something flawed in the way the data is presented in BigQuery?
> 
> That said, I'm fully OK with the answer that even a tiny bit of work is too much, and the limited volunteer effort of PyPI should be spent elsewhere.
> 
> -glyph
> 

It’s not hard at all, it’d just be (standard SQL mode):

    SELECT file.filename, COUNT(*) AS downloads FROM `the-psf.pypi.downloads*` WHERE file.project = "twisted" GROUP BY file.filename

You can probably guess how to handle modifications to this query since it’s roughly just regular old SQL.

There are a few reasons I don’t want to just do this in PyPI though.

This query will take somewhere between 30 and 60 seconds to complete, so I can’t do it inline with the the HTTP request, and I’d need to  have a periodic job go through and issue about 100k queries (or a single query with almost a million results) and then load that into the database.

More importantly though, we don’t have an unlimited amount of BigQuery on PyPI. We get blocks of credits granted periodically and so the faster we use up “spend” the more regularly I have to track down my contacts inside of Google and get them to re-up the credit. This adds an incentive to to try and reduce our spending where we can to limit the frequency and the amount of time I need to go between asking for more credits. Due to BigQuery’s billing model you get billed based upon how much data your query has to process which means that a query that fetches data for all time, will be the most expensive kind of query and gets more expensive every day. 

On the flip side, the BigQuery data is publicly query able and the account being used to query “pays” for that query and every account gets 1TB of querying for free (and additional TBs are $5 per TB). Currently it takes ~215GB of data to do a “full” query for twisted (the exact query I listed above) and I haven’t fully backfilled all of the data yet (I’m working on it). You can kind of extrapolate that out to what it would “cost" to do that same query for all 100k projects even before I do the backfill (which would drastically raise the “cost” of PyPI here).

The smart thing to do with BigQuery is to do date limited querying so that your query doesn’t have to load as much data. For instance, adapting the above query so that it only queries the last 30 days (still using standard SQL) you would do:

    SELECT file.filename, COUNT(*) AS downloads FROM `the-psf.pypi.downloads*` WHERE file.project = "twisted" AND _TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", DATE_ADD(CURRENT_DATE(), INTERVAL -31 day)) AND FORMAT_DATE("%Y%m%d", DATE_ADD(CURRENT_DATE(), INTERVAL -1 day)) GROUP BY file.filename 

This touches a much more reasonable 27GB of data.

For reference, we currently “spend” about $50/month on BigQuery so doing like, daily updates of this data for everyone would be a drastic increase in the amount of BigQuery spending we do.

So the tl;dr is I think it’s a better solution for vanity to talk to the BigQuery API itself, ideally limiting itself to a recent timeframe by default, and possibly adding a flag to get at the all time data for people who are OK with either using vanity less often or are willing to spend a couple bucks if they’re querying the full amount of data every day. Where Warehouse is starting to query BigQuery, I’m purposely limiting it to only the last N days (typically 30) so as not to regularly query the entire data set.

—
Donald Stufft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20170306/8ded50a5/attachment-0001.html>