[melbourne-pug] Joblib question

Sat Mar 10 01:13:33 EST 2018

I've run the process a couple of times and there doesn't seem to be an 
appreciable difference. Both methods take enough time to boil the 
kettle. I know that isn't proper testing. It might be difficult to test 
timing accurately when we are waiting on websites all over the world to 
respond. Might set up a long running test to try and smooth out the 
differences.

M

On 10/03/2018 5:04 PM, Mike Dewhirst wrote:
> On 9/03/2018 7:30 PM, Alejandro Dubrovsky wrote:
>> delayed is a decorator, so it takes a function or a method. You are 
>> passing it a generator instead.
>>
>> def make_links(self):
>>     Parallel(n_jobs=-2)(delayed(scrape_db)(self, 
>> create_useful_link(self, Link, db), db) for db in databases
>> )
>>
>> should work, 
>
> Yes it does :) Thank you Alejandro
>
>> but it will only parallelise over the scrape_db calls, not the 
>> create_useful_link calls I think. Which of the two do you want to 
>> parallelise over? Or were you after parallelising both?
>
> I think I probably want to use Celery (thanks Ed for the suggestion) 
> or similar so I can loop through (currently) nine databases and kick 
> off a scrape_db() task for each. Then each scrape_db task looks for 
> (currently) ten data items of specific interest. Having scraped a data 
> item we need to get_or_create (this is in Django) the specific data 
> note and add the result to whatever is there.
>
> That data note update might be a bottleneck with more than one 
> scrape_db task in parallel retrieving the same data item; say aqueous 
> solubility. We want aqueous solubility from all databases in the same 
> note so the user can easily compare different values and decide which 
> value to use.
>
> So parallelising everything might eventually be somewhat problematic. 
> It all has to squeeze through Postgres atomic transactions right at 
> the end. I suppose this is a perfect example of an IO bound task.
>
> Also, another thing is that the app is (currently) all server side. 
> I'm not (yet) using AJAX to update the screen when the data becomes 
> available.
>
> Cheers
>
> Mike
>
>>
>> On 09/03/18 18:41, Mike Dewhirst wrote:
>>> https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf
>>>
>>> I'm trying to make the following code run in parallel on separate 
>>> CPU cores but haven't had any success.
>>>
>>> def make_links(self): for db in databases: link = 
>>> create_useful_link(self, Link, db) if link: scrape_db(self, link, db)
>>>
>>> This is a web scraper which is working nicely in a leisurely 
>>> sequential manner.  databases is a list of urls with gaps to be 
>>> filled by create_useful_link() which makes a link record from the 
>>> Link class. The self instance is a source of attributes for filling 
>>> the url gaps. self is a chemical substance and the link record url 
>>> field when clicked in a browser will bring up that external website 
>>> with the chemical substance selected for researching by the viewer. 
>>> If successful, we then fetch the external page and scrape a bunch of 
>>> interesting data from it and turn that into substance notes. 
>>> scrape_db() doesn't return anything but it does create up to nine 
>>> other records.
>>>
>>>          from joblib import Parallel, delayed
>>>
>>>          class Substance( etc ..
>>>              ...
>>>              def make_links(self):
>>>                  #Parallel(n_jobs=-2)(delayed(
>>>                  #    scrape_db(self, create_useful_link(self, Link, 
>>> db), db) for db in databases
>>>                  #))
>>>
>>> I'm getting a TypeError from Parallel delayed() - can't pickle 
>>> generator objects
>>>
>>> So my question is how to write the commented code properly? I 
>>> suspect I haven't done enough comprehension.
>>>
>>> Thanks for any help
>>>
>>> Mike
>>>
>>>
>>> _______________________________________________
>>> melbourne-pug mailing list
>>> melbourne-pug at python.org
>>> https://mail.python.org/mailman/listinfo/melbourne-pug
>>>
>>
>> _______________________________________________
>> melbourne-pug mailing list
>> melbourne-pug at python.org
>> https://mail.python.org/mailman/listinfo/melbourne-pug
>

-- 

Climate Pty Ltd
PO Box 308
Mount Eliza
Vic 3930
Australia +61

T: 03 9034 3977
M: 0411 704 143