Finding sentinel text when using a thread pool...

Fri May 19 01:51:50 EDT 2017

Greetings,

I'm developing a web scraper script. It takes 25 minutes to process 590 
pages and ~9,000 comments. I've been told that the script is taking too 
long.

The way the script currently works is that the page requester is a 
generator function that requests a page, checks if the page contains the 
sentinel text (i.e., "Sorry, no more comments."), and either yields the 
page and request the next page or exits the function. Every yielded page 
is parsed by Beautiful Soup and saved to disk.

Timing the page requester separately from the rest of the script and the 
end value set to 590, each page request takes 1.5 seconds.

If I use a thread pool of 16 threads, each request takes 0.1 seconds. 
(Higher thread numbers will result in the server forcibly closing the 
connection.)

I'm trying to figure out how I would find the sentinel text by using a 
thread pool. Seems like I need to request an arbitrary number of pages 
(perhaps one page per thread), evaluate the contents of each page for 
the sentinel text, and either request another set of pages or exit the 
function.

Is that the most efficient approach for using a thread pool?

I'm using this article for the thread pool coding example.

http://chriskiehl.com/article/parallelism-in-one-line/

Thank you,

Chris Reimer