[Python-Dev] Python-3.0, unicode, and os.environ

Mon Dec 8 08:04:15 CET 2008

On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 12/7/2008 9:11 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <v+python at g.nevcal.com>
>> wrote:
>
> Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I
> wonder if I could find that code?  Can you supply a validated decoder?  Then
> we could run some benchmarks, eh?

There is no point for me, as the behaviour of a real UTF-8 codec is
clear.  It is you who needs to justify a second non-standard UTF-8-ish
codec.  See below.

>>> You didn't address the issue that if the decoding to a canonical form is
>>> done first, many of the insecurities just go away, so why throw errors?
>>
>> Unicode is intended to allow interaction between various bits of
>> software.  It may be that a library checked it in UTF-8, then passed
>> it to python.  It would be nice if the library validated too, but a
>> major advantage of UTF-8 is older libraries (or protocols!) intended
>> for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
>> security checks continue to work, so long as nobody down stream
>> introduces problems with a non-validating decoder.
>
>
> So I don't understand how this is responsive to the "decoding removes many
> insecurities" issue?
>
> Yes, you might use libraries.  Either they have insecurities, or not. Either
> they validate, or not.  Either they decode, or not.  They may be immune to
> certain attacks, because of their structure and code, or not.
>
> So when you examine a library for potential use, you have documentation or
> code to help you set your expectations about what it does, and whether or
> not it may have vulnerabilities, and whether or not those vulnerabilities
> are likely or unlikely, whether you can reduce the likelihood or prevent the
> vulnerabilities by wrapping the API, etc.  And so you choose to use the
> library, or not.
>
> This whole discussion about libraries seems somewhat irrelevant to the
> question at hand, although it is certainly true that understanding how a
> library handles Unicode is an important issue for the potential user of a
> library.
>
> So how does a non-validating decoder introduce problems?  I can see that it
> might not solve all problems, but how does it introduce problems? Wouldn't
> the problems be introduced by something else, and the use of a
> non-validating decoder may not catch the problem... but not be the cause of
> the problem?
>
> And then, if you would like to address the original issue, that would be
> fine too.

Your non-validating encoder is translating an invalid sequence into a
valid one, thus you are introducing the problem.  A completely naive
environment (8-bit clean ASCII) would leave it as an invalid sequence
throughout.

This is not a theoretical problem.  See
http://tools.ietf.org/html/rfc3629#section-10 .  We MUST reject
invalid sequences, or else we are not using UTF-8.  There is no wiggle
room, no debate.

(The absoluteness is why the standard behaviour doesn't need a
benchmark.  You are essentially arguing that, when logging in as root
over the internet, it's a lot faster if you use telnet rather than
ssh.  One is simply not an option.)

-- 
Adam Olsen, aka Rhamphoryncus