[I18n-sig] Re: Unicode 3.1 and contradictions.

Guido van Rossum guido@digicool.com
Thu, 28 Jun 2001 10:51:30 -0400


[Markus]
> > > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> > > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> > > any of these characters.

[Guido]
> > Can you explain a bit more about the security issues?

[Markus]
> There are two ways of processing UTF-8 encoded UCS text:
> 
>   a) as a UTF-8 bytestream
>   b) as a stream of decoded integer code values (32-bit wchar_t, etc.)
> 
> Problems arise if security-relevant checks are done in one
> representation and interpretation of the data is done in the other.
> 
> Imagine, you have an application with the following processing steps:
> 
>   - read a UTF-8 string
>   - apply a substring test to convince yourself that certain characters
>     are not present in the string
>   - decode UTF-8
>   - use the decoded string in an application where presence of the
>     tested characters could be security critical

I'd say that the security implementation of such an application is
broken -- the check should have been done on the final datya.  It
seems you are trying to patch up a legacy system the wrong way.  Or am
I missing something?  How can this be a common pattern?

> The classical example is a Win32 web server, where a UTF-8 URL is fed
> in, tested by a script in UTF-8 to be free of the byte sequence '/../',
> and then UTF-8 decoded and fed into a UTF-16 API for file system access.
> Even though the presence of '/../' encoded in ASCII was filtered out,
> the same character sequence can still be passed past the filter by a
> clever attacker using alternative encodings that an unsafe UTF-8 decoder
> might accept, for instance an overlong sequence for any of the
> characters.

Here you are assuming an unsafe UTF-8 decoder.  I agree that an UTF-8
decoder that accepts overlong sequences is broken.

But we were talking about isolated surrogates.  How can passing
through *isolated* surrogates cause a security violation?  It's not an
overlong sequence!  (Assuming the decoder does the right thing for
surrogate *pairs*.)

> This problem is most severe with non-ASCII representations of ASCII
> characters by overlong UTF-8 sequences, because ASCII characters have
> often lots of special functions associated, but it also occurs with
> other tests. For example, it should be perfectly legitimate to test a
> UTF-8 string to be free of non-BMP characters by simply testing that no
> byte >= 0xE0 is present, without the far less efficient use of a UTF-8
> decoder.

Why is testing for non-BMP characters part of a security screening?
Maybe you are worried that an application will over-index some table
prepared for the BMP only.  But Python already protects against
over-indexing with an exception.

Why would you want a security screening of the UTF-8 stream when
you're going to decode it eventually?  If you *have* to check that no
decoded character is >= 2**16, faster than a separate scan would be to
fold the security screening into the UTF-8 codec.

> Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a
> system, which when decoded into UTF-16 might be interpreted as an
> instruction to swap the byte sex (anti-BOM) or as some generic
> escape-or-end-of-string/file character (U+FFFF).

These aren't isolated surrogates, so they would fall under a different
rule (currently they pass through Python's UTF-8 codec just fine).  I
have the feeling that you want the UTF-8 decoder to make up for all
the sloppy coding practices that might be used in the application.

> The golden rule that there must be exactly one single UTF-8 byte
> sequence that can result in the output of a certain Unicode character
> and that Unicode code positions reserved for special non-character use
> such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by
> a UTF-8 decoder eliminates all these potential pitfalls.

Sorry, you haven't convinced me that these tests should be applied by
Python's standard UTF-8 codec.  Also, your use of "such as" suggests
that the collection of dangerous code points is open-ended, but I find
that hard to believe (since legacy codecs won't be updated).

> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

--Guido van Rossum (home page: http://www.python.org/~guido/)