[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 21:55:09 CEST 2011

On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
> Terry Reedy writes:
>
>   >  Excuse me for believing the fine 3.2 manual that says
>   >  "Strings contain Unicode characters."
>
> The manual is wrong, then, subject to a pronouncement to the contrary,

Please suggest a re-wording then, as it is a bug for doc and behavior to 
disagree.

>   >  For the purpose of my sentence, the same thing in that code points
>   >  correspond to characters,
>
> Not in Unicode, they do not.  By definition, a small number of code
> points (eg, U+FFFF) *never* did and *never* will correspond to
> characters.

On computers, characters are represented by code points. What about the 
other way around? http://www.unicode.org/glossary/#C says
code point:
1) i in range(0x11000) <broad definition>
2) "A value, or position, for a character" <narrow definition>
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.

>   >  Any narrow build string with even 1 non-BMP char violates the
>   >  standard.
>
> Yup.  That's by design.
[...]
> Sure.  Nevertheless, practicality beat purity long ago, and that
> decision has never been rescinded AFAIK.

I think you have it backwards. I see the current situation as the purity 
of the C code beating the practicality for the user of getting right 
answers.

> The thing is, that 90% of applications are not really going to care
> about full conformance to the Unicode standard.

I remember when Intel argued that 99% of applications were not going to 
be affected when the math coprocessor in its then new chips occasionally 
gave 'non-standard' answers with certain divisors.

>   >  Currently, the meaning of Python code differs on narrow versus wide
>   >  build, and in a way that few users would expect or want.
>
> Let them become developers, then, and show us how to do it better.

I posted a proposal with a link to a prototype implementation in Python. 
It pretty well solves the problem of narrow builds acting different from 
wide builds with respect to the basic operations of len(), iterations, 
indexing, and slicing.

> No, I do like the PEP.  However, it is only a step, a rather
> conservative one in some ways, toward conformance to the Unicode
> character model.  In particular, it does nothing to resolve the fact
> that len() will give different answers for character count depending
> on normalization, and that slicing and indexing will allow you to cut
> characters in half (even in NFC, since not all composed characters
> have fully composed forms).

I believe my scheme could be extended to solve that also. It would 
require more pre-processing and more knowledge than I currently have of 
normalization. I have the impression that the grapheme problem goes 
further than just normalization.

-- 
Terry Jan Reedy