[melbourne-pug] Joblib question
Mike Dewhirst
miked at climate.com.au
Sat Mar 10 01:13:33 EST 2018
I've run the process a couple of times and there doesn't seem to be an
appreciable difference. Both methods take enough time to boil the
kettle. I know that isn't proper testing. It might be difficult to test
timing accurately when we are waiting on websites all over the world to
respond. Might set up a long running test to try and smooth out the
differences.
M
On 10/03/2018 5:04 PM, Mike Dewhirst wrote:
> On 9/03/2018 7:30 PM, Alejandro Dubrovsky wrote:
>> delayed is a decorator, so it takes a function or a method. You are
>> passing it a generator instead.
>>
>> def make_links(self):
>> Parallel(n_jobs=-2)(delayed(scrape_db)(self,
>> create_useful_link(self, Link, db), db) for db in databases
>> )
>>
>> should work,
>
> Yes it does :) Thank you Alejandro
>
>> but it will only parallelise over the scrape_db calls, not the
>> create_useful_link calls I think. Which of the two do you want to
>> parallelise over? Or were you after parallelising both?
>
> I think I probably want to use Celery (thanks Ed for the suggestion)
> or similar so I can loop through (currently) nine databases and kick
> off a scrape_db() task for each. Then each scrape_db task looks for
> (currently) ten data items of specific interest. Having scraped a data
> item we need to get_or_create (this is in Django) the specific data
> note and add the result to whatever is there.
>
> That data note update might be a bottleneck with more than one
> scrape_db task in parallel retrieving the same data item; say aqueous
> solubility. We want aqueous solubility from all databases in the same
> note so the user can easily compare different values and decide which
> value to use.
>
> So parallelising everything might eventually be somewhat problematic.
> It all has to squeeze through Postgres atomic transactions right at
> the end. I suppose this is a perfect example of an IO bound task.
>
> Also, another thing is that the app is (currently) all server side.
> I'm not (yet) using AJAX to update the screen when the data becomes
> available.
>
> Cheers
>
> Mike
>
>>
>> On 09/03/18 18:41, Mike Dewhirst wrote:
>>> https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf
>>>
>>> I'm trying to make the following code run in parallel on separate
>>> CPU cores but haven't had any success.
>>>
>>> def make_links(self): for db in databases: link =
>>> create_useful_link(self, Link, db) if link: scrape_db(self, link, db)
>>>
>>> This is a web scraper which is working nicely in a leisurely
>>> sequential manner. databases is a list of urls with gaps to be
>>> filled by create_useful_link() which makes a link record from the
>>> Link class. The self instance is a source of attributes for filling
>>> the url gaps. self is a chemical substance and the link record url
>>> field when clicked in a browser will bring up that external website
>>> with the chemical substance selected for researching by the viewer.
>>> If successful, we then fetch the external page and scrape a bunch of
>>> interesting data from it and turn that into substance notes.
>>> scrape_db() doesn't return anything but it does create up to nine
>>> other records.
>>>
>>> from joblib import Parallel, delayed
>>>
>>> class Substance( etc ..
>>> ...
>>> def make_links(self):
>>> #Parallel(n_jobs=-2)(delayed(
>>> # scrape_db(self, create_useful_link(self, Link,
>>> db), db) for db in databases
>>> #))
>>>
>>> I'm getting a TypeError from Parallel delayed() - can't pickle
>>> generator objects
>>>
>>> So my question is how to write the commented code properly? I
>>> suspect I haven't done enough comprehension.
>>>
>>> Thanks for any help
>>>
>>> Mike
>>>
>>>
>>> _______________________________________________
>>> melbourne-pug mailing list
>>> melbourne-pug at python.org
>>> https://mail.python.org/mailman/listinfo/melbourne-pug
>>>
>>
>> _______________________________________________
>> melbourne-pug mailing list
>> melbourne-pug at python.org
>> https://mail.python.org/mailman/listinfo/melbourne-pug
>
--
Climate Pty Ltd
PO Box 308
Mount Eliza
Vic 3930
Australia +61
T: 03 9034 3977
M: 0411 704 143
More information about the melbourne-pug
mailing list