Thread safety issue (I think) with defaultdict

Thu Nov 2 12:27:41 EDT 2017

> On Nov 1, 2017, at 4:53 PM, Steve D'Aprano <steve+python at pearwood.info> wrote:
> 
> On Thu, 2 Nov 2017 05:53 am, Israel Brewster wrote:
> 
> [...]
>> So the end result is that the thread that "updates" the dictionary, and the
>> thread that initially *populates* the dictionary are actually running in
>> different processes.
> 
> If they are in different processes, that would explain why the second
> (non)thread sees an empty dict even after the first thread has populated it:
> 
> 
> # from your previous post
>> Length at get AC:  54 ID: 4524152200  Time: 2017-11-01 09:41:24.474788
>> Length At update:  1 ID: 4524152200  Time: 2017-11-01 09:41:24.784399
>> Length At update:  2 ID: 4524152200  Time: 2017-11-01 09:41:25.228853
> 
> 
> You cannot rely on IDs being unique across different processes. Its an
> unfortunately coincidence(!) that they end up with the same ID.

I think it's more than a coincidence, given that it is 100% reproducible. Plus, in an earlier debug test I was calling print() on the defaultdict object, which gives output like "<defaultdict object at 0x1066467f0>", where presumably the 0x1066467f0 is a memory address (correct me if I am wrong in that). In every case, that address was the same. So still a bit puzzling.

> 
> Or possibly there's some sort of weird side-effect or bug in Flask that, when
> it shares the dict between two processes (how?) it clears the dict.

Well, it's UWSGI that is creating the processes, not Flask, but that's semantics :-) The real question though is "how does python handle such situations?" because, really, there would be no difference (I wouldn't think) between what is happening here and what is happening if you were to create a new process using the multiprocessing library and reference a variable created outside that process.

In fact, I may have to try exactly that, just to see what happens.

> 
> Or... have you considered the simplest option, that your update thread clears
> the dict when it is first called? Since you haven't shared your code with us,
> I cannot rule out a simple logic error like this:
> 
> def launch_update_thread(dict):
>    dict.clear()
>    # code to start update thread

Actually, I did share my code. It's towards the end of my original message. I cut stuff out for readability/length, but nothing having to do with the dictionary in question. So no, clear is never called, nor any other operation that could clear the dict.

> 
> 
>> In fact, any given request could be in yet another 
>> process, which would seem to indicate that all bets are off as to what data
>> is seen.
>> 
>> Now that I've thought through what is really happening, I think I need to
>> re-architect things a bit here. 
> 
> Indeed. I've been wondering why you are using threads at all, since there
> doesn't seem to be any benefit to initialising the dict and updating it in
> different thread. Now I learn that your architecture is even more complex. I
> guess some of that is unavailable, due to it being a web app, but still.

What it boils down to is this: I need to update this dictionary in real time as data flows in. Having that update take place in a separate thread enables this update to happen without interfering with the operation of the web app, and offloads the responsibility for deciding when to switch to the OS. There *are* other ways to do this, such as using gevent greenlets or asyncio, but simply spinning off a separate thread is the easiest/simplest option, and since it is a long-running thread the overhead of spinning off the thread (as opposed to a gevent style interlacing) is of no consequence.

As far as the initialization, that happens in response to a user request, at which point I am querying the data anyway (since the user asked for it). The idea is I already have the data, since the user asked for it, why not save it in this dict rather than waiting to update the dict until new data comes in? I could, of course, do a separate request for the data in the same thread that updates the dict, but there doesn't seem to be any purpose in that, since until someone requests the data, I don't need it for anything.

> 
> 
>> For one thing, the update thread should be 
>> launched from the main process, not an arbitrary UWSGI worker. I had
>> launched it from the client connection because there is no point in having
>> it running if there is no one connected, but I may need to launch it from
>> the __init__.py file instead. For another thing, since this dictionary will
>> need to be accessed from arbitrary worker processes, I'm thinking I may need 
>> to move it to some sort of external storage, such as a redis database
> 
> That sounds awful. What if the arbitrary worker decides to remove a bunch of
> planes from your simulation, or insert them? There should be one and only one
> way to insert or remove planes from the simulation (I **really** hope it is a
> simulation).

UWSGI uses worker processes to respond to requests from web clients. What can and can't be done from a web interface is, of course, completely up to me as the developer, and may well be modifying basic data structures. HOW the requests are handled, however, is completely up to UWSGI.

> 
> Surely the right solution is to have the worker process request whatever
> information it needs, like "the next plane", and have the main process
> provide the data. Having worker processes have the ability to reach deep into
> the data structures used by the main program and mess with them seems like a
> good way to have mind-boggling bugs.

Except the worker processes *are* the main program. That's how UWSGI works - it launches a number of worker processes to handle incoming web requests. It's not like I have a main process that is doing something, and *additionally* a bunch of worker processes. While I'm sure UWSGI does have a "master" process it uses to control the workers, that's all an internal implementation detail of UWSGI, not something I deal with directly. I just have the flask code, which doesn't deal with or know about separate processes at all. The only exception is the one *thread* I launch (not process, thread) to handle the background updating.

> 
> 
> 
>> Oy, I made my life complicated :-)
> 
> "Some people, when confronted with a problem, think, 'I know, I'll use
> threads. Nothew y htwo probave lems."
> 
> :-)

Actually, that saying is about regular expressions, not threads :-) . In the end, threads are as good a way as handling concurrency as any other, and simpler than many. They have their drawbacks, of course, mainly in the area of overhead, and of course only multiprocessing can *really* take advantage of multiple cores/CPU's on a machine, but unlike regular expressions, threads aren't ugly or complicated. Only the details of dealing with concurrency make things complicated, and you'll have to deal with that in *any* concurrency model.

> 
> 
> 
> -- 
> Steve
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list