[Python-Dev] PEP 393 Summer of Code Project
Terry Reedy
tjreedy at udel.edu
Wed Aug 24 21:55:09 CEST 2011
On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
> Terry Reedy writes:
>
> > Excuse me for believing the fine 3.2 manual that says
> > "Strings contain Unicode characters."
>
> The manual is wrong, then, subject to a pronouncement to the contrary,
Please suggest a re-wording then, as it is a bug for doc and behavior to
disagree.
> > For the purpose of my sentence, the same thing in that code points
> > correspond to characters,
>
> Not in Unicode, they do not. By definition, a small number of code
> points (eg, U+FFFF) *never* did and *never* will correspond to
> characters.
On computers, characters are represented by code points. What about the
other way around? http://www.unicode.org/glossary/#C says
code point:
1) i in range(0x11000) <broad definition>
2) "A value, or position, for a character" <narrow definition>
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.
> > Any narrow build string with even 1 non-BMP char violates the
> > standard.
>
> Yup. That's by design.
[...]
> Sure. Nevertheless, practicality beat purity long ago, and that
> decision has never been rescinded AFAIK.
I think you have it backwards. I see the current situation as the purity
of the C code beating the practicality for the user of getting right
answers.
> The thing is, that 90% of applications are not really going to care
> about full conformance to the Unicode standard.
I remember when Intel argued that 99% of applications were not going to
be affected when the math coprocessor in its then new chips occasionally
gave 'non-standard' answers with certain divisors.
> > Currently, the meaning of Python code differs on narrow versus wide
> > build, and in a way that few users would expect or want.
>
> Let them become developers, then, and show us how to do it better.
I posted a proposal with a link to a prototype implementation in Python.
It pretty well solves the problem of narrow builds acting different from
wide builds with respect to the basic operations of len(), iterations,
indexing, and slicing.
> No, I do like the PEP. However, it is only a step, a rather
> conservative one in some ways, toward conformance to the Unicode
> character model. In particular, it does nothing to resolve the fact
> that len() will give different answers for character count depending
> on normalization, and that slicing and indexing will allow you to cut
> characters in half (even in NFC, since not all composed characters
> have fully composed forms).
I believe my scheme could be extended to solve that also. It would
require more pre-processing and more knowledge than I currently have of
normalization. I have the impression that the grapheme problem goes
further than just normalization.
--
Terry Jan Reedy
More information about the Python-Dev
mailing list