[ python-Bugs-1054943 ] Python may contain NFC/NFKC bug per Unicode PRI #29

SourceForge.net noreply at sourceforge.net
Tue Mar 15 09:59:59 CET 2005


Bugs item #1054943, was opened at 2004-10-27 01:58
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1054943&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Rick McGowan (rick_mcgowan)
Assigned to: Martin v. Löwis (loewis)
Summary: Python may contain NFC/NFKC bug per Unicode PRI #29

Initial Comment:
The Unicode Technical Committee posted Public Review
Issue #29, describing a bug in the documentation of NFC
and NFKC in the text of UAX #15 Unicode Normalization
Forms. I have examined unicodedata.c in the Python
implementation (2.3.4) and it appears the
implementation of normalization in Python 2.3.4 may
have the bug therein described. Please see the
description of the bug and the textual fix that is
being made to UAX #15, at the URL:
http://www.unicode.org/review/pr-29.html
The bug is in the definition of rule D2, affecting the
characters "blocked" during re-composition.

You may contact me by e-mail, or fill out the
Unicode.org error reporting form if you have any
questions or concerns.

Since Python uses Unicode internally, it may also be
wise to have someone from the Python development
community on the Unicode Consortium's notification list
to receive immediate notifications of public review
issues, bugs, and other announcements affecting
implementation of the standard.


----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2005-03-15 09:59

Message:
Logged In: YES 
user_id=21627

Is it true that the most recent interpretation of this PR
suggests that the correction should only apply to Unicode
4.1? If so, I think Python should abstain from adopting the
change right now, and should defer that to the point when
the Unicode 4.1 database is incorporated.

----------------------------------------------------------------------

Comment By: Rick McGowan (rick_mcgowan)
Date: 2004-10-27 22:11

Message:
Logged In: YES 
user_id=1146994

Thanks all for quick reply. My initial thoughts regarding a
fix were as below. The relevant piece of code seems to be in
function "nfc_nfkc()" in the file unicodedata.c

>           if (comb1 && comb == comb1) { 
>               /* Character is blocked. */ 
>               i1++; 
>               continue; 
>           } 

That should possibly be changed to: 

>           if (comb1 && (comb <= comb1)) { 
>               /* Character is blocked. */ 
>               i1++; 
>               continue; 
>           } 

because the new spec says "either B is a starter or it has
the same or higher combining class as C".


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-10-27 20:11

Message:
Logged In: YES 
user_id=38388

Thanks for submitting a bug report. The problem does indeed
occur in the Python normalization code:

>>> unicodedata.normalize('NFC', u'\u0B47\u0300\u0B3E')
u'\u0b4b\u0300'

I think the following line in unicodedata.c needs to be changed:

          if (comb1 && comb == comb1) {
              /* Character is blocked. */
              i1++;
              continue;
          }

to

          if (comb && (comb1 == 0 || comb == comb1)) {
              /* Character is blocked. */
              i1++;
              continue;
          }

Martin, what do you think ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1054943&group_id=5470


More information about the Python-bugs-list mailing list