[Distutils] [Python-ideas] PyPI search still broken

Dan Poirier dpoirier at caktusgroup.com
Thu Sep 10 15:57:51 CEST 2015


Just curious, are we re-indexing the whole thing each time, or does it take
40 minutes to update the index for 3 hours' worth of changes?


*Dan Poirier*Developer

dpoirier at caktusgroup.com
www.caktusgroup.com

On Thu, Sep 10, 2015 at 9:31 AM, Donald Stufft <donald at stufft.io> wrote:

> On September 10, 2015 at 8:48:05 AM, David Wilson (
> dw+python-ideas at hmmz.org) wrote:
> > On Thu, Sep 10, 2015 at 03:07:14PM +0300, Ionel Cristian Mărieș wrote:
> >
> > > Wouldn't it be better if you'd just build an external search service?
> > > Getting a list of packages and descriptions should be possible no?
> > > (just asking, not 100% sure)
> >
> > That would be the idea. In fact preferably not build a service at all,
> > just pay someone $50/mo for hosted ElasticSearch, rip out the guts of
> > the old thing and write a small sync cron job similar to the one
> > existing in the Bitbucket repo I linked.
> >
> >
>
> The old PostgreSQL based system has been gone for awhile, and we already
> have ElasticSearch with a small cron job that runs every 3 hours to index
> the data.
>
> When we moved the database to Heroku this cronjob started taking 6+ hours
> to
> complete, because we were fetching data in too small of chunks which didn't
> actually hurt when the script and the database were running close to each
> other. That got "fixed" a day or two ago by increasing the size of the
> chunks
> we pulled from 1000 to 10000 and by switching to a
> SERIALIZABLE READ ONLY DEFERRABLE transaction so that we only needed to
> hold
> open a lock right at the very beginning which has the job finishing in 40
> minutes now. I suspect further enhancements to the indexing speed will
> require
> locating the script in EC2 to get it closer to the PostgreSQL instance.
>
> Given that these problems seem to be *new* since the move of the database
> to
> Heroku, I don't think the shape of our data in Elasticsearch nor the actual
> query we're using which hasn't changed should be at fault, so I've been
> trying
> to figure out what else we might have changed in the transition that would
> have
> caused it.
>
> -----------------
> Donald Stufft
> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372
> DCFA
>
>
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20150910/947dc2a0/attachment-0001.html>


More information about the Distutils-SIG mailing list