BeautifulSoup doesn't work with a threaded input queue?

Peter Otten __peter__ at web.de
Sun Aug 27 17:45:27 EDT 2017


Christopher Reimer via Python-list wrote:

> On 8/27/2017 1:31 PM, Peter Otten wrote:
> 
>> Here's a simple example that extracts titles from generated html. It
>> seems to work. Does it resemble what you do?
> Your example is similar to my code when I'm using a list for the input
> to the parser. You have soup_threads and write_threads, but no
> read_threads.
> 
> The particular website I'm scraping requires checking each page for the
> sentinel value (i.e., "Sorry, no more comments") in order to determine
> when to stop requesting pages. 

Where's that check happening? If it's in the soup thread you need some kind 
of back channel to the read threads to inform them that you're need no more 
pages.
 
> For my comment history that's ~750 pages
> to parse ~11,000 comments.
> 
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. My soup_threads can get
> items from the queue, but BeautifulSoup doesn't do anything after that.
> 
> Chris R.





More information about the Python-list mailing list