Finding sentinel text when using a thread pool...

Sat May 20 04:19:07 EDT 2017

Christopher Reimer <christopher_reimer at icloud.com> writes:
> I'm developing a web scraper script. It takes 25 minutes to process
> 590 pages and ~9,000 comments. I've been told that the script is
> taking too long.
>
> The way the script currently works is that the page requester is a
> generator function that requests a page, checks if the page contains
> the sentinel text (i.e., "Sorry, no more comments."), and either
> yields the page and request the next page or exits the function. Every
> yielded page is parsed by Beautiful Soup and saved to disk.
>
> Timing the page requester separately from the rest of the script and
> the end value set to 590, each page request takes 1.5 seconds.

That's very slow to fetch a page.

> If I use a thread pool of 16 threads, each request takes 0.1
> seconds. (Higher thread numbers will result in the server forcibly
> closing the connection.)
>
> I'm trying to figure out how I would find the sentinel text by using a
> thread pool. Seems like I need to request an arbitrary number of pages
> (perhaps one page per thread), evaluate the contents of each page for
> the sentinel text, and either request another set of pages or exit the
> function.

If your (590) pages are linked together (such that you must fetch
a page to get the following one) and page fetching is the limiting
factor, then this would limit the parallelizability.

If processing a selected page takes a significant amount of time
(compared to the fetching), then you could use a work queue as follows:
a page is fetched and the following page determined; if a following
page is found, processing this page is put as a job into the work queue
and page processing is continued. Free tasks look for jobs in the work queue
and process them.