collect data using threads

Kent Johnson kent37 at tds.net
Tue Jun 14 14:11:09 EDT 2005


Qiangning Hong wrote:
> I actually had considered Queue and pop() before I wrote the above code.
>  However, because there is a lot of data to get every time I call
> get_data(), I want a more CPU friendly way to avoid the while-loop and
> empty checking, and then the above code comes out.  But I am not very
> sure whether it will cause serious problem or not, so I ask here.  If
> anyone can prove it is correct, I'll use it in my program, else I'll go
> back to the Queue solution.

OK, here is a real failure mode. Here is the code and the disassembly:
 >>> class Collector(object):
 ...     def __init__(self):
 ...         self.data = []
 ...     def on_received(self, a_piece_of_data):
 ...         """This callback is executed in work bee threads!"""
 ...         self.data.append(a_piece_of_data)
 ...     def get_data(self):
 ...         x = self.data
 ...         self.data = []
 ...         return x
 ...
 >>> import dis
 >>> dis.dis(Collector.on_received)
  6           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                1 (data)
              6 LOAD_ATTR                2 (append)
              9 LOAD_FAST                1 (a_piece_of_data)
             12 CALL_FUNCTION            1
             15 POP_TOP
             16 LOAD_CONST               1 (None)
             19 RETURN_VALUE
 >>> dis.dis(Collector.get_data)
  8           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                1 (data)
              6 STORE_FAST               1 (x)

  9           9 BUILD_LIST               0
             12 LOAD_FAST                0 (self)
             15 STORE_ATTR               1 (data)

 10          18 LOAD_FAST                1 (x)
             21 RETURN_VALUE

Imagine the thread calling on_received() gets as far as LOAD_ATTR (data), LOAD_ATTR (append) or LOAD_FAST (a_piece_of_data), so it has a reference to self.data; then it blocks and the get_data() thread runs. The get_data() thread could call get_data() and *finish processing the returned list* before the on_received() thread runs again and actually appends to the list. The appended value will never be processed.

If you want to avoid the overhead of a Queue.get() for each data element you could just put your own mutex into on_received() and get_data().

Kent



More information about the Python-list mailing list