Finding sentinel text when using a thread pool...

Christopher Reimer christopher_reimer at yahoo.com
Sat May 20 13:20:05 EDT 2017


On 5/20/2017 1:19 AM, dieter wrote:

> If your (590) pages are linked together (such that you must fetch
> a page to get the following one) and page fetching is the limiting
> factor, then this would limit the parallelizability.

The pages are not linked together. The URL requires a page number. If I 
requested 1000 pages in sequence, the first 60% will have comments and 
the remaining 40% will have the sentinel text. As more comments are 
added over time, the dividing line between the last page with the oldest 
comments and the first page with the sentinel page shifts over time. 
Since I changed the code to fetch 16 pages at the same time, the run 
time got reduced by nine minutes.

> If processing a selected page takes a significant amount of time
> (compared to the fetching), then you could use a work queue as follows:
> a page is fetched and the following page determined; if a following
> page is found, processing this page is put as a job into the work queue
> and page processing is continued. Free tasks look for jobs in the work queue
> and process them.

I'm looking into that now. The requester class yields one page at a 
time. If I change the code to yield a list of 16 pages, I could parse 16 
pages at a time. That change would require a bit more work but it would 
fix some problems that's been nagging me for a while about the parser class.

Thank you,

Chris Reimer



More information about the Python-list mailing list