Thread safety issue (I think) with defaultdict

Israel Brewster israel at ravnalaska.net
Wed Nov 1 13:04:58 EDT 2017


Let me rephrase the question, see if I can simplify it. I need to be able to access a defaultdict from two different threads - one thread that responds to user requests which will populate the dictionary in response to a user request, and a second thread that will keep the dictionary updated as new data comes in. The value of the dictionary will be a timestamp, with the default value being datetime.min, provided by a lambda:

lambda: datetime.min

At the moment my code is behaving as though each thread has a *separate* defaultdict, even though debugging shows the same addresses - the background update thread never sees the data populated into the defaultdict by the main thread. I was thinking race conditions or the like might make it so one particular loop of the background thread occurs before the main thread, but even so subsequent loops should pick up on the changes made by the main thread.

How can I *properly* share a dictionary like object between two threads, with both threads seeing the updates made by the other?
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------




> On Oct 31, 2017, at 9:38 AM, Israel Brewster <israel at ravnalaska.net> wrote:
> 
> A question that has arisen before (for example, here: https://mail.python.org/pipermail/python-list/2010-January/565497.html <https://mail.python.org/pipermail/python-list/2010-January/565497.html>) is the question of "is defaultdict thread safe", with the answer generally being a conditional "yes", with the condition being what is used as the default value: apparently default values of python types, such as list, are thread safe, whereas more complicated constructs, such as lambdas, make it not thread safe. In my situation, I'm using a lambda, specifically:
> 
> lambda: datetime.min
> 
> So presumably *not* thread safe.
> 
> My goal is to have a dictionary of aircraft and when they were last "seen", with datetime.min being effectively "never". When a data point comes in for a given aircraft, the data point will be compared with the value in the defaultdict for that aircraft, and if the timestamp on that data point is newer than what is in the defaultdict, the defaultdict will get updated with the value from the datapoint (not necessarily current timestamp, but rather the value from the datapoint). Note that data points do not necessarily arrive in chronological order (for various reasons not applicable here, it's just the way it is), thus the need for the comparison.
> 
> When the program first starts up, two things happen:
> 
> 1) a thread is started that watches for incoming data points and updates the dictionary as per above, and
> 2) the dictionary should get an initial population (in the main thread) from hard storage.
> 
> The behavior I'm seeing, however, is that when step 2 happens (which generally happens before the thread gets any updates), the dictionary gets populated with 56 entries, as expected. However, none of those entries are visible when the thread runs. It's as though the thread is getting a separate copy of the dictionary, although debugging says that is not the case - printing the variable from each location shows the same address for the object.
> 
> So my questions are:
> 
> 1) Is this what it means to NOT be thread safe? I was thinking of race conditions where individual values may get updated wrong, but this apparently is overwriting the entire dictionary.
> 2) How can I fix this?
> 
> Note: I really don't care if the "initial" update happens after the thread receives a data point or two, and therefore overwrites one or two values. I just need the dictionary to be fully populated at some point early in execution. In usage, the dictionary is used to see of an aircraft has been seen "recently", so if the most recent datapoint gets overwritten with a slightly older one from disk storage, that's fine - it's just if it's still showing datetime.min because we haven't gotten in any datapoint since we launched the program, even though we have "recent" data in disk storage thats a problem. So I don't care about the obvious race condition between the two operations, just that the end result is a populated dictionary. Note also that as datapoint come in, they are being written to disk, so the disk storage doesn't lag significantly anyway.
> 
> The framework of my code is below:
> 
> File: watcher.py
> 
> last_points = defaultdict(lambda:datetime.min)
> 
> # This function is launched as a thread using the threading module when the first client connects
> def watch():
> 	while true:
> 		<wait for datapoint>
> 		pointtime= <extract/parse timestamp from datapoint>
> 		if last_points[<aircraft_identifier>] < pointtime:
> 			<do stuff>
> 			last_points[<aircraft_identifier>]=pointtime
> 			#DEBUGGING
> 			print("At update:", len(last_points))
> 
> 
> File: main.py:
> 
> from .watcher import last_points
> 
> # This function will be triggered by a web call from a client, so could happen at any time
> # Client will call this function immediately after connecting, as well as in response to various user actions.
> def getac():
> 	<load list of aircraft and times from disk>
> 	<do stuff to send the list to the client>
> 	for record in aclist:
> 		last_points[<aircraft_identifier>]=record_timestamp
> 	#DEBUGGING
> 	print("At get AC:", len(last_points))
> 
> 
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> -----------------------------------------------
> 
> 
> 
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list