[Python-Dev] Python and the Unicode Character Database

Wed Dec 1 00:19:30 CET 2010

On 11/30/2010 10:05 AM, Alexander Belopolsky wrote:

My general answers to the questions you have raised are as follows:

1. Each new feature release should use the latest version of the UCD as 
of the first beta release (or perhaps a week or so before). New chars 
are new features and the beta period can be used to (hopefully) iron out 
any bugs introduced by a new UCD version.

2. The language specification should not be UCD version specific. Martin 
pointed out that the definition of identifiers was intentionally written 
to not be, bu referring to 'current version' or some such. On the other 
hand, the UCD version used should be programatically discoverable, 
perhaps as an attribute of sys or str.

3.. The UCD should not change in bugfix releases. New chars are new 
features. Adding them in bugfix releases will introduce gratuitous 
imcompatibilities between releases. People who want the latest Unicode 
should either upgrade to the latest Python version or patch an older 
version (but not expect core support for any problems that creates).

> Given that 2.7 will be maintained for 5 years and arguably Unicode
> Consortium takes backward compatibility very seriously, wouldn't it
> make sense to consider a backport at some point?
>
> I am sure we will soon see a bug report that the following does not
> work in 2.7: :-)
>>>> ord('\N{CAT FACE WITH WRY SMILE}')
> 128572

3 (cont). 2.7 is no different in that regard. It is feature frozen just 
like all other x.y releases. And that is the answer to any such report. 
If that code became valid in 2.7.2, for instance, it would still not 
work in 2.7 and 2.7.1. Not working is not a bug; working is a new 
feature introduced after 2.7 was released.

>>> - How specific should library reference manual be in defining methods
>>> affected by UCD such as str.upper()?
>>
>> It should specify what this actually does in Unicode terminology
>> (probably in addition to a layman's rephrase of that)
>>
>
> I opened an issue for this:
>
> http://bugs.python.org/issue10587

1,2 (cont). Good idea in general.

> I was more concerned about wide an narrow unicode CPython builds.  Is
> it a bug that   '\UXXXXXXXX'.isalpha() may disagree even when the two
> implementations are based on the same version of UCD?

4. While the difference between narrow/wide builds of (CPython) x.y 
(which should have once constant UCD) cannot be completely masked, I 
appreciate and generally agree with  your efforts to minimize them. In 
some cases, there will be a conflict/tradeoff between eliminating this 
difference versus that.

-- 
Terry Jan Reedy