BeautifulSoup doesn't work with a threaded input queue?

Peter Otten __peter__ at web.de
Sun Aug 27 16:31:35 EDT 2017


Christopher Reimer via Python-list wrote:

> On 8/27/2017 11:54 AM, Peter Otten wrote:
> 
>> The documentation
>>
>> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
>>
>> says you can make the BeautifulSoup object from a string or file.
>> Can you give a few more details where the queue comes into play? A small
>> code sample would be ideal.
> 
> A worker thread uses a request object to get the page and puts it into
> queue as page.content (HTML).  Another worker thread gets the
> page.content from the queue to apply BeautifulSoup and nothing happens.
> 
> soup = BeautifulSoup(page_content, 'lxml')
> print(soup)
> 
> No output whatsoever. If I remove 'lxml', I get the UserWarning that no
> parser wasn't explicitly set and get the reference to threading.py at
> line 80.
> 
> I verified that page.content that goes into and out of the queue is the
> same page.content that goes into and out of a list.
> 
> I read somewhere that BeautifulSoup may not be thread-safe. I've never
> had a problem with threads storing the output into a queue. Using a
> queue (random order) instead of a list (sequential order) to feed pages
> for the input is making it wonky.

Here's a simple example that extracts titles from generated html. It seems 
to work. Does it resemble what you do?

import csv
import threading
import time
from queue import Queue

import bs4


def process_html(source, dest, index):
    while True:
        html = source.get()
        if html is DONE:
            dest.put(DONE)
            break
        soup = bs4.BeautifulSoup(html, "lxml")
        dest.put(soup.find("title").text)


def write_csv(source, filename, to_go):
    with open(filename, "w") as f:
        writer = csv.writer(f)
        while True:
            title = source.get()
            if title is DONE:
                to_go -= 1
                if not to_go:
                    return
            else:
                writer.writerow([title])

NUM_SOUP_THREADS = 10
DONE = object()

web_to_soup = Queue()
soup_to_file = Queue()

soup_threads = [
    threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, 
i))
    for i in range(NUM_SOUP_THREADS)
]

write_thread = threading.Thread(
    target=write_csv,  args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS),
)

write_thread.start()

for thread in soup_threads:
    thread.start()

for i in range(100):
    web_to_soup.put("<html><head><title>#{}</title></head></html>".format(i))
for i in range(NUM_SOUP_THREADS):
    web_to_soup.put(DONE)

for t in soup_threads:
    t.join()
write_thread.join()





More information about the Python-list mailing list