[Python-3000] Unicode and OS strings

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Mon Sep 17 21:12:00 CEST 2007


Dnia 15-09-2007, So o godzinie 09:13 +0900, Stephen J. Turnbull
napisał(a):

>  > Well, for any scheme which attempts to modify UTF-8 by accepting
>  > arbitrary byte strings is used, *something* must be interpreted
>  > differently than in real UTF-8.
> 
> Wrong.  In my scheme everything ends up in the PUA, on which real
> UTF-8 imposes no interpretation by definition.

This is wrong: UTF-8 is specified for PUA. PUA is no special from the
point of view of UTF-8. UTF-8 is defined for all Unicode scalar values,
i.e. all code points in the ranges U+0000..U+D7FF and U+E000..U+10FFFF,
i.e. all code points excluding surrogates. This includes PUA.

> I haven't gone back to check yet, but it's possible that a "real UTF-8
> conforming process" is required to stop processing and issue an error
> or something like that in the cases we're trying to handle.

"C10. When a process interprets a code unit sequence which purports to
be in a Unicode character encoding form, it shall treat ill-formed code
unit sequences as an error condition and shall not interpret such
sequences as characters."

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/



More information about the Python-3000 mailing list