[melbourne-pug] Joblib question

Mike Dewhirst miked at dewhirst.com.au
Sat Mar 10 01:03:59 EST 2018


On 10/03/2018 12:33 PM, paul sorenson wrote:
>
> Mike,
>
> Are there unique features of joblib that you need to use?
>

I was seduced by "Parallel". On reading the docs a little more 
diligently it seems well suited to parallel computation with heavy 
compute-bound stuff like scientific number crunching and disk caching 
results to prevent re-computing.

> Scraping web pages is often a good candidate for asyncio based models.
>

I think I'm being seduced by io in the name. I do judge books by their 
cover so I think I'll read asyncio

Thanks Paul

Mike
>
> cheers
>
>
> On 03/08/2018 11:41 PM, Mike Dewhirst wrote:
>> https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf
>>
>> I'm trying to make the following code run in parallel on separate CPU 
>> cores but haven't had any success.
>>
>> def make_links(self): for db in databases: link = 
>> create_useful_link(self, Link, db) if link: scrape_db(self, link, db)
>> This is a web scraper which is working nicely in a leisurely 
>> sequential manner.  databases is a list of urls with gaps to be 
>> filled by create_useful_link() which makes a link record from the 
>> Link class. The self instance is a source of attributes for filling 
>> the url gaps. self is a chemical substance and the link record url 
>> field when clicked in a browser will bring up that external website 
>> with the chemical substance selected for researching by the viewer. 
>> If successful, we then fetch the external page and scrape a bunch of 
>> interesting data from it and turn that into substance notes. 
>> scrape_db() doesn't return anything but it does create up to nine 
>> other records.
>>
>>          from joblib import Parallel, delayed
>>
>>          class Substance( etc ..
>>              ...
>>              def make_links(self):
>>                  #Parallel(n_jobs=-2)(delayed(
>>                  #    scrape_db(self, create_useful_link(self, Link, db), db) for db in databases
>>                  #))
>> I'm getting a TypeError from Parallel delayed() - can't pickle 
>> generator objects
>>
>> So my question is how to write the commented code properly? I suspect 
>> I haven't done enough comprehension.
>>
>> Thanks for any help
>>
>> Mike
>>
>>
>> _______________________________________________
>> melbourne-pug mailing list
>> melbourne-pug at python.org
>> https://mail.python.org/mailman/listinfo/melbourne-pug
>



More information about the melbourne-pug mailing list