Threading question .. am I doing this right?

Thu Feb 24 15:03:54 EST 2022

On Fri, 25 Feb 2022 at 06:54, Robert Latest via Python-list
<python-list at python.org> wrote:
>
> I have a multi-threaded application (a web service) where several threads need
> data from an external database. That data is quite a lot, but it is almost
> always the same. Between incoming requests, timestamped records get added to
> the DB.
>
> So I decided to keep an in-memory cache of the DB records that gets only
> "topped up" with the most recent records on each request:

Depending on your database, this might be counter-productive. A
PostgreSQL database running on localhost, for instance, has its own
caching, and data transfers between two apps running on the same
computer can be pretty fast. The complexity you add in order to do
your own caching might be giving you negligible benefit, or even a
penalty. I would strongly recommend benchmarking the naive "keep going
back to the database" approach first, as a baseline, and only testing
these alternatives when you've confirmed that the database really is a
bottleneck.

> Since it is better to quickly serve the client with slightly outdated data than
> not at all, I came up with the "impatient" solution below. The idea is that an
> incoming request triggers an update query in another thread, waits for a short
> timeout for that thread to finish and then returns either updated or old data.
>
> class MyCache():
>     def __init__(self):
>         self.cache = None
>         self.thread_lock = Lock()
>         self.update_thread = None
>
>     def _update(self):
>         new_records = query_external_database()
>         if self.cache is None:
>             self.cache = new_records
>         else:
>             self.cache.extend(new_records)
>
>     def get_data(self):
>         if self.cache is None:
>             timeout = 10 # allow more time to get initial batch of data
>         else:
>             timeout = 0.5
>         with self.thread_lock:
>             if self.update_thread is None or not self.update_thread.is_alive():
>                 self.update_thread = Thread(target=self._update)
>                 self.update_thread.start()
>                 self.update_thread.join(timeout)
>
>         return self.cache
>
>     my_cache = MyCache()
>
> My question is: Is this a solid approach? Am I forgetting something? For
> instance, I believe that I don't need another lock to guard self.cache.append()
> because _update() can ever only run in one thread at a time. But maybe I'm
> overlooking something.

Hmm, it's complicated. There is another approach, and that's to
completely invert your thinking: instead of "request wants data, so
let's get data", have a thread that periodically updates your cache
from the database, and then all requests return from the cache,
without pinging the requester. Downside: It'll be requesting fairly
frequently. Upside: Very simple, very easy, no difficulties debugging.

How many requests per second does your service process? (By
"requests", I mean things that require this particular database
lookup.) What's average throughput, what's peak throughput? And
importantly, what sorts of idle times do you have? For instance, if
you might have to handle 100 requests/second, but there could be
hours-long periods with no requests at all (eg if your clients are all
in the same timezone and don't operate at night), that's a very
different workload from 10 r/s constantly throughout the day.

ChrisA