Using time.sleep() in 2 threads causes lockup when hyper-threading is enabled

Tim Peters tim.peters at gmail.com
Fri May 5 00:24:43 EDT 2006


[Tim Peters]
>> I didn't run it for hours ;-)

[OlafMeding at gmail.com]
> Please try.

OK, I let the first test program run for over 24 hours by now.  It
never hung.  Overnight, the box did go into sleep mode, but the test
woke itself up after sleep mode ended, and the threads reported they
were sleeping for about 8 hours then.

I did this under a debug build of Python, so that if it did hang I'd
have _some_chance of getting useful info from the VC 7.1 debugger. 
It's possible (but unlikely) that using a debug-build Python prevented
a hang that would have occurred had I used a release-build Python.

> The sleep statement does not return!  And this should not happen.  The
> code above does nothing special or unusual.  The problem only occurs if
> 2 threads use the sleep statement and hyper-threading is enabled.
>
> We discovered this bug perhaps a year ago.  The only solution was to
> tell our customers to disable hyper-threading (you can imagine they did
> not like our "solution" very much).  It then took many days of hard
> work to isolate the problem down to the code I posted above.

As before, since Python merely calls the Win32 API Sleep() function,
it's extremely unlikely that the problem is due to Python.

It's quite possible that the problem is due to a tiny timing hole in
MS's implementation of Sleep().  Since I don't have the source code
for that, and disassembling is prohibited by my license ;-), I can't
pursue that.

I've seen software with honest-to-God thread-race bugs that were never
reported across years of heavy production use, until a user tried the
code on a hyper-threaded or multi-core box.  Tiny timing holes can be
_extremely_ shy, and running on a box with true, or even just
finer-grained (like HT), concurrency can make them much more likely to
appear.  I've never seen a software failure with (eventually) known
cause occur on an HT box that could not have happened on a non-HT box.
 The statistics of extremely unlikely events aren't a natural fit to
the unabused human mind ;-)

> ...
> Once the application locks up (getting stuck in a sleep statement) all
> comes back to live if I pull the network cable out.  So perhaps the
> socket thread returns from the sleep statement and other threads return
> to live because they were waiting for the socket thread.

That's peculiar.  time.sleep() called from a thread other than the
main thread on Windows is non-interruptable (because the Windows
Sleep() function is non-interruptable).  time.sleep() called from the
main thread on Windows _is_ interruptable:  the main thread uses the
Win32 API WaitForSingleObject() instead, passing a handle to a custom
interrupt event; the _intent_ is so that a time.sleep() in the main
thread can be aborted by a keyboard interrupt.  But it shouldn't be
possible for anything other than a keyboard interrupt or a normal
(timeout) return to knock that loose.

So if unplugging the cable knocks things lose, that just points even
stronger at a bug in the Windows kernel.

> Our software runs on both Windows and Linux.  We are not sure if the
> problem also happens on Linux.

Well, I ran your test for a day on my Windows box -- you try running
it for a week on your Linux box ;-)

> ...
> We have searched the Internet far and wide and were not able to find
> any information that indicates that someone else has reported a similar
> problem (neither Python nor Windows Sleep related).

I'm pretty sure I would have known about it if anyone reported such a
Python bug in the last 15 years.  But this is the first time I've
heard it.  I don't keep up with Microsoft bug reports at all.



More information about the Python-list mailing list