[Security-sig] PEP 522: Allow BlockingIOError in security sensitive APIs on Linux

Thu Jun 23 20:33:22 EDT 2016

On 23 June 2016 at 15:54, Victor Stinner <victor.stinner at gmail.com> wrote:
>> The new exception would potentially be encountered in the following situations:
>>
>> * Python code calling these APIs during Linux system initialization
>
> I'm not sure that there is such use case in practice.
>
> Can you please try to describe an use case where you would need
> blocking system urandom *during the Python initialization*?
>
> It looks like my use case 1, but I consider that os.urandom() is *not*
> called on such use case:
> https://haypo-notes.readthedocs.io/pep_random.html#use-case-1-init-script

My preference for an exception comes from the fact that we can never
prove the non-existence of proprietary software that does certain
things, but we *can* ensure that such code gets an easy to debug
exception rather than a potential deadlock if it does exist.

The argument chain runs:

- if such software doesn't exist, it doesn't matter which behaviour we choose
- if we're wrong and it does exist, we can choose how it fails:
  - blocking (with associated potential for init system deadlock)
  - throwing an exception

Given the choice between debugging an apparent system hang and an
unexpected exception when testing against a new version of a platform,
I'll choose the exception every time.

>> * Python code running on improperly initialized Linux systems (e.g. embedded
>>   hardware without adequate sources of entropy to seed the system random number
>>   generator, or Linux VMs that aren't configured to accept entropy from the
>>   VM host)
>
> If the program doesn't use os.urandom(), well, we don't care, there is
> no issue :-)
>
> IMO the interesting use case is when the application really requires
> secure secret. That's my use case 2, a web server:
> https://haypo-notes.readthedocs.io/pep_random.html#use-case-2-web-server
>
> I chose to not give the choice to the developer and block on such
> case. IMO it's accepable because the application should not have to
> wait forever for urandom.

Should not, but actually can, depending on the characteristics of the
underlying system and its runtime environment.

>> Changing ``os.urandom()`` on Linux
>> ----------------------------------
>>
>> This PEP proposes that in Python 3.6+, ``os.urandom()`` be updated to call
>> the new Linux ``getrandom()`` syscall in non-blocking mode if available and
>> raise ``BlockingIOError: system random number generator is not ready`` if
>> the kernel reports that the call would block.
>
> To be clear, the behaviour is unchanged on other platforms, right?

Cory Benfield pointed out that the proposal as currently written isn't
clear as to whether or not it applies to recent versions of Solaris
and Illumos, as they also provide a getrandom() syscall.

> I'm just trying to understand the scope of the PEP. It looks like as
> mine, it is written for Linux. (Even if other platforms may implement
> the same behaviour later, if needed.)
>
> If it's deliberate to restrict to Linux, you may be more explicit at
> least in the abstract.

It's in the PEP title: "Allow BlockingIOError in security sensitive
APIs on Linux"

However, I need to update it to indicate it applies to any system that
provides a non-blocking getrandom() syscall.

> --
>
> By the way, are you aware of other programming languages or
> applications using an exception when random would block? (It's not a
> requirement, I'm just curious.)

No, but I haven't really gone looking either. It's also worth keeping
in mind that it's only in the last 12 months folks have even had the
*option* of doing better than just reading from /dev/urandom and
hoping it's been initialised properly.

>> By contrast, if ``BlockingIOError`` is raised in those situations, then
>> developers using Python 3.6+ can easily choose their desired behaviour:
>>
>> 1. Loop until the call succeeds (security sensitive)
>
> Is this case different from a blocking os.urandom()?

Yes, as it's up to the application to decide when it wants to check
for the system RNG being ready, and how it wants to report that to the
user. For example, it may decide to emit a runtime warning before it
enters the busy loop (I'm actually having a discussion with Donald in
another thread regarding a possible design for a
"secrets.wait_for_system_rng()" API that meshes well with the other
changes proposed in PEP 522).

>> 2. Switch to using the random module (non-security sensitive)
>
> Hum, I disagree on this point. I don't think that you should start
> with os.urandom() to fallback on random.
>
> In fact, I only know *one* use case for this: create the random.Random
> instance when the random module is imported.
>
> In my PEP, I proposed to have a special case for random.Random
> constructor, implemented in C (to not have to expose anything at the
> Python level).

We have two use cases for a fallback just in the standard library
(SipHash initiliasition and random module initialisation). Rather than
assuming no other use cases for the feature exist, we can expose the
fallback mechanism we use ourselves and let people decide for
themselves whether or not they want to do something similar.

>> 3. Switch to reading ``/dev/urandom`` directly (non-security sensitive)
>
> It is what I propose for the random.Random constructor when the random
> module is imported.
>
> Again, the question is if there is a real use case for it. And if yes,
> if the use case common enough to justify the change?
>
> The extreme case is that all applications using os.urandom() would
> need to be modifiy to add a try/except BlockingIOError. I only
> exagerate to try to understand the impact of your PEP. I only that
> only a few applications will use such try/except in practice.

That's where the idea of also adding secrets.wait_for_system_rng()
comes, rather than having to wrap every library call in a try/except
block (or risk having those APIs become blocking ones such that async
developers feel obliged to call them in a separate thread)

> As I tried to explain in my PEP, with Python 3.5.2, "the bug" (block
> on random) became very unlikely.

Aye, I agree with that (hence the references to this being an obscure,
Linux-specific problem in PEP 522). However, I think it makes sense to
stipulate that someone porting to Python 3.6 *has* unexpectedly
encountered the new behaviour, and is trying to debug what has gone
wrong with their application/system when comparing the two designs for
usability.

>> Issuing a warning for potentially predictable internal hash initialization
>
> I don't recall Python logging warnings for similar issues. But I don't
> recall similar issues neither :-)

It's a pretty unique problem, and not one we've been able to detect it
in the past.

>> The challenge for internal hash initialization is that it might be very
>> important to initialize SipHash with a reliably unpredictable random seed
>> (for processes that are exposed to potentially hostile input) or it might be
>> totally unimportant (for processes that never have to deal with untrusted data).
>
> From what I read, /dev/urandom is good even before it is considered as
> initialized, because the kernel collects various data, but don't
> increase the entropy estimator.
>
> I'm not completely convinced that a warning is needed. I'm not against
> it neither. I am doubtful. :-)
>
> Well, let's say that we have a warning. What should the user do in
> such case? Is it an advice to dig the urandom issue and try to get
> more entropy?
>
> The warning is for users, no? I imagine that an application can work
> perfectly for the developer, but only emit the warning for some users
> depending how the deploy their application.

It's a warning primarily for system integrators (i.e. the folks
developing a distro, designing an embedded device or configuring a VM)
that they need to either:

- reconfigure the application to start later in the boot process (e.g.
after the network comes up)
- write a systemd PreExec snippet that waits for the system RNG to be
initialised (that will be particularly easy if it can be written as
"python3 -c 'import secrets; secrets.wait_for_system_rng()")
- add a better entropy source to their system

The kind of wording I'm thinking of is along the lines of:

"Python hash initialization: using potentially predictable fallback
hash seed; avoid handling untrusted potentially hostile data in this
process"

>> However, at the same time, since Python has no way to know whether any given
>> invocation needs to handle untrusted data, when the default SipHash
>> initialization fails this *might* indicate a genuine security problem, which
>> should not be allowed to pass silently.
>
> An alternative would be to provide a read-only flag which would
> indicate if the hash secret is considered as "secure" or not.
>
> Applications considered by security would check the flag and decide
> themself to emit a warning or not.

I really don't want to add any more knobs and dials that need to be
documented and learned if we can possibly avoid it (and I think we
can).

In this case, turning off hash randomisation entirely will suppress
the warning along with hash randomisation itself.

>> Accordingly, if internal hash initialization needs to fall back to a potentially
>> predictable seed due to the system random number generator not being ready, it
>> will also emit a warning message on ``stderr`` to say that the system random
>> number generator is not available and that processing potentially hostile
>> untrusted data should be avoided.
>
> I know that many of you disagree with me, but I'm not sure that the
> hash DoS is an important issue.
>
> We should not overestimate the importance of this vulnerability.

It was never particularly important (the payload multiplier on the
Denial-of-Service isn't that big), but it was high profile and
splashy, and it's relatively cheap to take into account (since folks
that know it doesn't apply to them can still turn randomization off
entirely)

>> Affected security sensitive applications
>> ----------------------------------------
>>
>> Security sensitive applications would need to either change their system
>> configuration so the application is only started after the operating system
>> random number generator is ready for security sensitive operations, or else
>> change their code to busy loop until the operating system is ready::
>>
>>     def blocking_urandom(num_bytes):
>>         while True:
>>             try:
>>                 return os.urandom(num_bytes)
>>             except BlockingIOError:
>>                 pass
>
> Such busy-loop may use a lot of CPU :-/ You need a time.sleep() or
> something like that, no?

Maybe - we can work out the exact details once I've added the
secrets.wait_for_system_rng() proposal to the PEP.

> A blocking os.urandom() doesn't have such issue ;-)

It also doesn't let an app fail gracefully if it opts not to support
running without a pre-initialised system RNG :)

> Is it possible that os.urandom() works, but the following os.urandom()
> call raises a BlockingIOError? If yes, there is an issue with "partial
> read", we should uses a dedicated exception to return partial data.

No, it's not possible with os.urandom(). (It *can* happen with
/dev/random and with getentropy() on OpenBSD and Solaris, which is why
folks say "don't use those for anything")

> Hopefully, I understood that the issue doesn't occur in pratice.
> os.urandom() starts with BlockingIOError. But once it "works", it will
> work forever. Well, at least on Linux.
>
> I don't know how Solaris behaves. I hope that it behaves as Linux
> (once it works, it always works). At least, I see that Solaris
> getrandom() can also fails with EAGAIN.

It's the same logic as Linux (once a CSPRNG is properly seeded it can
never run out of entropy, but seeding it in the first place does
require entropy collection)

>> Affected non-security sensitive applications
>> --------------------------------------------
>>
>> Non-security sensitive applications that don't want to assume access to
>> ``/dev/urandom`` (or assume a non-blocking implementation of that device)
>> can be updated to use the ``random`` module as a fallback option::
>>
>>     def pseudorandom_fallback(num_bytes):
>>         try:
>>             return os.urandom(num_bytes)
>>         except BlockingIOError:
>>             random.getrandbits(num_bytes*8).to_bytes(num_bytes, "little")
>>
>> Depending on the application, it may also be appropriate to skip accessing
>> ``os.urandom`` at all, and instead rely solely on the ``random`` module.
>
> Hum, I dislike such change. It overcomplicates applications for a corner-case.
>
> If you use os.urandom(), you already expect security. I prefer to
> simplify use cases to two cases: (1) you really need security (2) you
> really don't care of security. If you don't care, use directly the
> random module. Don't bother with os.urandom() nor having to add
> try/except BlockingIOError. No?
>
> I *hope* that a regular application will never see BlockingIOError on
> os.urandom() in the wild.

Yeah, hence why I'm shifting more in favour of the
secrets.wait_for_system_rng() idea (which folks can then use as
inspiration to write their own "wait for the system RNG" helpers for
earlier Python and operating system versions)

>> Affected Linux specific non-security sensitive applications
>> -----------------------------------------------------------
>>
>> Non-security sensitive applications that don't need to worry about cross
>> platform compatibility and are willing to assume that ``/dev/urandom`` on
>> Linux will always retain its current behaviour can be updated to access
>> ``/dev/urandom`` directly::
>>
>>     def dev_urandom(num_bytes):
>>         with open("/dev/urandom", "rb") as f:
>>             return f.read(num_bytes)
>
> Again, I'm against adding such complexity for a corner case. Just use
> os.urandom().

All of this would be triggered by *application* developers actually
hitting the BlockingIOError and decide it was the appropriate course
of application for *their* application. The point of this part of the
PEP is to highlight that there are some really simple 3-5 functions
that let developers get a wide variety of behaviours in ways that are
compatible with single-source Python 2/3 code.

>> For additional background details beyond those captured in this PEP, also see
>> Victor Stinner's summary at http://haypo-notes.readthedocs.io/pep_random.html
>
> Oh, I didn't expect to have references to my document :-) I moved it to:
> https://haypo-notes.readthedocs.io/summary_python_random_issue.html
>
> http://haypo-notes.readthedocs.io/pep_random.html is now really a PEP ;-)

Cool, I'll update the first reference and also and a reference to your
draft PEP.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia