threading

Thu Apr 10 08:56:33 EDT 2014

In article <87wqexmmuc.fsf at elektro.pacujo.net>,
 Marko Rauhamaa <marko at pacujo.net> wrote:

>  * When you wake up from select() (or poll(), epoll()), you should treat
>    it as a hint. The I/O call (accept()) could still raise
>    socket.error(EAGAIN).

People often misunderstand what select() does.  The common misconception 
is that a select()ed descriptor has data waiting to be read.  What the 
man page says is, "A file descriptor is considered ready if it is 
possible to perform the corresponding I/O operation (e.g., read(2)) 
without blocking."  Not blocking includes failing immediately.

And, once you introduce threading, things get even more complicated.  
Imagine two threads, both waiting in a select() call on the same socket.  
Data comes in on that socket.  Both select() calls return.  If both 
threads then do reads on the socket, you've got a race condition.  One 
of them will read the data.  The other will block in the read call, 
because the data has already been read by the other thread!

So, yes, as Marko says, use select() as a hint, but then also do your 
reads in non-blocking mode, and be prepared for them to fail, regardless 
of whether select() said the descriptor was ready.

>    Note that modern software has to tolerate suspension (laptop lid,
>    virtual machines). Time is a tricky concept when your server wakes up
>    from a coma.

Not to mention running in a virtual machine.  Time is an equally tricky 
concept when your hardware clock is really some other piece of software 
playing smoke and mirrors.  I once worked on a time-sensitive system 
which was running in a VM.  The idiots who had configured the thing were 
running ntpd in the VM, to keep its clock in sync.  Normally, this is a 
good thing, but they were ALSO using the hypervisor's clock management 
gizmo (vmtools?) to adjust the VM clock.  The two mechanisms were 
fighting with each other, which did really weird stuff to time.

It took me forever to figure out what was going on.  How does one even 
observe that time is moving around randomly?  I eventually ended up 
writing a trivial NTP client in Python (it's only a few lines of code) 
and periodically logging the difference between the local system clock 
and what my NTP reference was telling me.  Of course, figuring out what 
was going on was the easy part.  Convincing the IT drones to fix the 
problem was considerably more difficult.

>  * In each state, check that you handle all possible events and
>    timeouts. The state/transition matrix will be quite sizable even for
>    seemingly simple tasks.

And, those empty boxes in the state transition matrix which are blank, 
because those transitions are impossible?  Guess what, they happen, and 
you better have a plan for when they do :-)