[I18n-sig] Re: Unicode 3.1 and contradictions.

Markus Kuhn Markus.Kuhn@cl.cam.ac.uk
Thu, 28 Jun 2001 09:20:32 +0100


> It is a bug to encode a non-BMP character with six
> bytes by pretending that the (surrogate) values used in the UTF-16
> representation are BMP characters and encoding the character as
> though it was a string consisting of that character.  It is also a
> bug to interpret such a six-byte sequence as a single character.
> This was clarified in Unicode 3.1.

Fully agreed. Independent of what the letter of the standard says, it is
absolutely essential for numerous practical security reasons, that a
UTF-8 decoder accepts one and only one single possible UTF-8 sequence as
the encoding of any Unicode character. ISO 10646 is also very clear
about that surrogates must not appear in a UTF-8 stream and are
malformed UTF-8 sequences. Unicode 3.0 was badly flawed in that respect
and that has led to numerous security problems in fielded implementations.
As I understand it, Unicode 3.1 fixed that, but in any case, no matter what
the standard says, you should definitely follow the advice given in the
UTF-8 decoder robustness test file

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/   UTF-8-test.txt

and accept only one single representation for every Unicode character,
otherwise you just generate nice loopholes for hackers to pass critical
characters through non-decoding filters.

The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
any of these characters.

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>