[Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen
rhamph at gmail.com
Mon Dec 8 08:04:15 CET 2008
On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 12/7/2008 9:11 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <v+python at g.nevcal.com>
>> wrote:
>
> Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I
> wonder if I could find that code? Can you supply a validated decoder? Then
> we could run some benchmarks, eh?
There is no point for me, as the behaviour of a real UTF-8 codec is
clear. It is you who needs to justify a second non-standard UTF-8-ish
codec. See below.
>>> You didn't address the issue that if the decoding to a canonical form is
>>> done first, many of the insecurities just go away, so why throw errors?
>>
>> Unicode is intended to allow interaction between various bits of
>> software. It may be that a library checked it in UTF-8, then passed
>> it to python. It would be nice if the library validated too, but a
>> major advantage of UTF-8 is older libraries (or protocols!) intended
>> for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their
>> security checks continue to work, so long as nobody down stream
>> introduces problems with a non-validating decoder.
>
>
> So I don't understand how this is responsive to the "decoding removes many
> insecurities" issue?
>
> Yes, you might use libraries. Either they have insecurities, or not. Either
> they validate, or not. Either they decode, or not. They may be immune to
> certain attacks, because of their structure and code, or not.
>
> So when you examine a library for potential use, you have documentation or
> code to help you set your expectations about what it does, and whether or
> not it may have vulnerabilities, and whether or not those vulnerabilities
> are likely or unlikely, whether you can reduce the likelihood or prevent the
> vulnerabilities by wrapping the API, etc. And so you choose to use the
> library, or not.
>
> This whole discussion about libraries seems somewhat irrelevant to the
> question at hand, although it is certainly true that understanding how a
> library handles Unicode is an important issue for the potential user of a
> library.
>
> So how does a non-validating decoder introduce problems? I can see that it
> might not solve all problems, but how does it introduce problems? Wouldn't
> the problems be introduced by something else, and the use of a
> non-validating decoder may not catch the problem... but not be the cause of
> the problem?
>
> And then, if you would like to address the original issue, that would be
> fine too.
Your non-validating encoder is translating an invalid sequence into a
valid one, thus you are introducing the problem. A completely naive
environment (8-bit clean ASCII) would leave it as an invalid sequence
throughout.
This is not a theoretical problem. See
http://tools.ietf.org/html/rfc3629#section-10 . We MUST reject
invalid sequences, or else we are not using UTF-8. There is no wiggle
room, no debate.
(The absoluteness is why the standard behaviour doesn't need a
benchmark. You are essentially arguing that, when logging in as root
over the internet, it's a lot faster if you use telnet rather than
ssh. One is simply not an option.)
--
Adam Olsen, aka Rhamphoryncus
More information about the Python-Dev
mailing list