[Python-Dev] Python-3.0, unicode, and os.environ

Sun Dec 7 19:56:35 CET 2008

On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman <murman at gmail.com> wrote:
> On Sun, Dec 7, 2008 at 11:35, Adam Olsen <rhamph at gmail.com> wrote:
>>>> http://bugs.python.org/issue3672
>>>> http://bugs.python.org/issue3297
>>
>> No.  Unicode *requires* them to be treated as errors.  If you want to
>> pass them through then you're creating a custom encoding... which you
>> might argue for in this case, but it needs to be clearly separate from
>> the real UTF-8.
>
> I suspect it is a common and convenient but (according to what you
> say) misconceived expectation that using UTF-8 to encode any Unicode
> string will not raise an exception. This behavior is not something
> which should be discarded lightly.

It is *not* a valid Unicode string in the first place.  Therein lies
the problem.

> I see little reason that this couldn't be a new codec or error handler
> that allowed people to choose between correct pure UTF-8 behavior or
> the technically incorrect but very practical behavior it currently
> has.

Note that many of the restrictions were added for security reasons.
You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
"../../../../../etc/password"), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.

However, in this context we only need to allow lone surrogates.
CESU-8 comes to mind.  (It is a perverse world we live in.)

-- 
Adam Olsen, aka Rhamphoryncus