[Security-sig] PEP 522: Allow BlockingIOError in security sensitive APIs on Linux

Fri Jun 24 20:07:37 EDT 2016

On 24 June 2016 at 16:21, Victor Stinner <victor.stinner at gmail.com> wrote:
> 2016-06-24 22:05 GMT+02:00 Nick Coghlan <ncoghlan at gmail.com>:
>> As such, the idioms I currently have in PEP 522 are wrong - the "wait
>> for the system RNG or not" decision wouldn't be one to be made on a
>> per-call basis, but rather on a per-__main__ execution basis, with
>> developers choosing which user experience they want to support on
>> systems with a non-blocking /dev/urandom:
>>
>> * this application will fail if you run it before the system RNG is
>> ready (so you may need to add "ExecStartPre=python3 -c 'import
>> secrets; secrets.wait_for_system_rng()'" in your systemd unit file)
>
> In short, if an application is not run using systemd but directly on
> the command line, it *can* fail with a fatal BlockingIOError?

>From the command line, the answer is equally simple: just run "python3
-c 'import secrets; secrets.wait_for_system_rng()'" before the command
you actually care about.

As an added bonus, that will work even if the command you care about
isn't written in Python 3, and even if it reads from /dev/urandom
rather than using the new syscall.

> Wait, I don't think that it is an acceptable behaviour from the user
> point of view.
>
> Compared to Python 2.7, Python 3.4 and Python 3.5.2 where os.urandom()
> never blocks nor raises an exception on Linux, such behaviour change
> can be seen as a major regression.

The *only* way to get it to block (your PEP) or raise an exception
(PEP 522) is to call os.urandom() (directly or indirectly) when the
kernel RNG isn't ready - I consider the relevant analogy to be to PEP
476, where we turned the silent security failure of accepting an
invalid or untrusted certificate (or one that didn't cover the named
host) into the noisy error of failing to make the connection.

>> * this application implicitly calls "secrets.wait_for_system_rng()"
>> and hence may block waiting for the system RNG if you run it before
>> the system RNG is ready
>
> It's hard to guess if os.urandom() is used in a third-party library.
> Maybe it's not. What if a new library version starts to use
> os.urandom()? Should you start to call secrets.wait_for_system_rng()?
>
> To be safe, I expect that *all* applications should start with
> secrets.wait_for_system_rng()... It doesn't make sense to have to put
> such code in *all* applications.

Application developers porting to Python 3.6 can wait and see what
their own testing reports and what their users report - they don't
need to guess.

> The main advantage of the PEP 522 is to control how the "system
> urandom not initialized yet" case is handled. But you are more and
> more saying that secrets.wait_for_system_rng() should be used to not
> get BlockingIOError in most cases. Am I wrong?

I'm saying I think it's an application level decision, not a library
level decision.

> I expect that some libraries will start to use
> secrets.wait_for_system_rng() in their own code.
>
> ... At the end, it looks you basically reimplemented a blocking
> os.urandom(), no?

Potentially, but one of the important aspects of PEP 522 is that we're
not imposing that outcome by fiat - we're letting developers choose
the behaviour they want on a case by case basis, and seeing what the
emergent consensus on correct behaviour turns out to be.

It's equally possible that the outcome will be that both Python and
Linux developers conclude that this is an operating system integration
issue, so systemd ends up adding a standard "kernelrng" target that
components can wait for, and that then gets included as a requirement
for getting to the singleuser state on most distros.

If we *do* reach a point where "always call
secrets.wait_for_system_rng() before using secrets,
random.SystemRandom or os.urandom" is the idiomatic advice for
Pythonistas, *then* we can make os.urandom() blocking, and
secrets.wait_for_system_rng() would reduced to:

    def wait_for_system_rng():
        os.urandom(1)

> --
>
> Why do we have to bother *all* users with
> secrets.wait_for_system_rng(), while only a very few will really care
> of the exceptional case?

We don't - only the ones that actually get the exception, since
they're necessarily the ones the problem is relevant to. Runtime
system configuration related exceptions aren't something to be avoided
at all costs - if they were, we'd never have made the changes we did
to the way Unicode handling works.

A good example of this at the library level is Armin Ronacher's click
command line helper - when you run that in the C locale under Python
3, it just fails immediately, since the actual problem is that
something has gone wrong and your system locale isn't configured
properly. The right answer is almost always to fix the locale
configuration settings, not to change anything in the Python code.

> Why not adding something for users who want to handle the exceptional
> case, but make os.urandom() blocking?

The main problem I have with the blocking solution is that if someone
hits it unexpectedly, they're left staring at a blinking cursor (at
best), and no helpful hints to get started on debugging the problem.
If it's a component they didn't write, they also can't really give a
good bug report beyond "It hangs when I try to run it".

By contrast, PEP 522 gives them an immediate exception and error
message: "BlockingIOError: system random number generator is not
ready".

If they're a developer themselves, they can plug that into Google and
hopefully find a relevant answer (which we can virtually guarantee by
preseeding Stack Overflow with a suitable response)

If they're *not* the application developer, they can paste the
traceback into a bug report or support ticket and say "Hey, what's
going on here?". At which point, the developer or support tech
handling the ticket can do the appropriate Google search and respond
accordingly.

Now, we could gain most of those debuggability benefits for a blocking
solution by trying in non-blocking mode first, then falling back to
blocking only if we get EAGAIN - that would let us print a
Google-friendly warning message before we implicitly block.

That's where the argument of adopting a consistent approach of "try
non-blocking first, then maybe fall back to something else if it
doesn't work" comes into play - if os.urandom() (and hence indirectly
the secrets module) is trying in non-blocking mode and falling back to
an alternative, *and* SipHash initialisation is doing that, *and*
importing the random module is doing that, it sends a strong message
to me that the base primitive here is actually "try to read the system
RNG, and maybe fail to do so", rather than "read the system RNG and
only return when the requested data is available"

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia