[Python-Dev] utf8 issue

Guido van Rossum guido@python.org
Fri, 06 Sep 2002 10:06:21 -0400


[MAL, on UTF-8 for unicode]
> Marshal uses it since 1.6. The point is that the fix to the
> lone surrogate problem resulted in a change of the UTF codec
> output. PYCs from unpatched and patched versions wouldn't
> interop if they use lone surrogates in Unicode literals. We
> usually bump the PYC magic in such a case, to avoid these
> issues. Since it's not possible for a patch level release,
> we have two choices:
> 
> 1. leave things as they are
> 
> 2. apply the fix and live with the consequences of having
>     to regenerate PYCs by hand

[but then later]

> One possibility would be to:
> 
> 1. change the UTF-8 encoder in Python 2.2 to produce correct
>     output
> 
> 2. let the UTF-8 decoder in Python 2.2 accept the correct
>     output *and* the maformed output

This sounds like the right solution.  I hope you can produce a patch
against the release22-maint branch.

> I am not sure whether 2. would introduce a security problem.
> Perhaps there is a way to restrict the work-around so that
> we don't run into UTF-8 encoding attack problems.

I don't see what this vulnerability (if it is one) adds to the already
laughable security of marshal and .pyc files.  If someone you don't
trust can write your .pyc files, they can cause your interpreter to
crash by inserting bogus bytecode.  So I'd say this is a non-issue.

--Guido van Rossum (home page: http://www.python.org/~guido/)