BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer
christopher_reimer at yahoo.com
Sun Aug 27 17:14:27 EDT 2017
On 8/27/2017 1:31 PM, Peter Otten wrote:
> Here's a simple example that extracts titles from generated html. It seems
> to work. Does it resemble what you do?
Your example is similar to my code when I'm using a list for the input
to the parser. You have soup_threads and write_threads, but no read_threads.
The particular website I'm scraping requires checking each page for the
sentinel value (i.e., "Sorry, no more comments") in order to determine
when to stop requesting pages. For my comment history that's ~750 pages
to parse ~11,000 comments.
I have 20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser. My soup_threads can get
items from the queue, but BeautifulSoup doesn't do anything after that.
Chris R.
More information about the Python-list
mailing list