Is there a more efficient threading lock?

Sat Feb 25 16:41:52 EST 2023

Thanks for the responses.

Peter wrote:

> Which OS is this?

MacOS Ventura 13.1, M1 MacBook Pro (eight cores).

Thomas wrote:

> I'm no expert on locks, but you don't usually want to keep a lock while
> some long-running computation goes on.  You want the computation to be
> done by a separate thread, put its results somewhere, and then notify
> the choreographing thread that the result is ready.

In this case I'm extracting the noun phrases from the body of an email
message (returned as a list). I have a collection of email messages
organized by month (typically 1000 to 3000 messages per month). I'm using
concurrent.futures.ThreadPoolExecutor() with the default number of workers (
os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so
12 active threads at a time. Given that the process is pretty much CPU
bound, maybe reducing the number of workers to the CPU count would make
sense. Processing of each email message enters that with block once. That's
about as minimal as I can make it. I thought for a bit about pushing the
textblob stuff into a separate worker thread, but it wasn't obvious how to
set up queues to handle the communication between the threads created by
ThreadPoolExecutor() and the worker thread. Maybe I'll think about it
harder. (I have a related problem with SQLite, since an open database can't
be manipulated from multiple threads. That makes much of the program's
end-of-run processing single-threaded.)

> This link may be helpful -
>
> https://anandology.com/blog/using-iterators-and-generators/

I don't think that's where my problem is. The lock protects the generation
of the noun phrases. My loop which does the yielding operates outside of
that lock's control. The version of the code is my latest, in which I
tossed out a bunch of phrase-processing code (effectively dead end ideas
for processing the phrases). Replacing the for loop with a simple return
seems not to have any effect. In any case, the caller which uses the
phrases does a fair amount of extra work with the phrases, populating a
SQLite database, so I don't think the amount of time it takes to process a
single email message is dominated by the phrase generation.

Here's timeit output for the noun_phrases code:

% python -m timeit -s 'text = """`python -m timeit --help`""" ; from
textblob import TextBlob ; from textblob.np_extractors import
ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text,
np_extractor=ext).noun_phrases' 'phrases = TextBlob(text,
np_extractor=ext).noun_phrases'
5000 loops, best of 5: 98.7 usec per loop

I process the output of timeit's help message which looks to be about the
same length as a typical email message, certainly the same order of
magnitude. Also, note that I call it once in the setup to eliminate the
initial training of the ConllExtractor instance. I don't know if ~100us
qualifies as long running or not.

I'll keep messing with it.

Skip