Unicode regular expressions -- buggy?

Christopher Subich spam.csubich+block at block.subich.spam.com
Thu Aug 11 02:52:46 EDT 2005


I don't think the python regular expression module correctly handles 
combining marks; it gives inconsistent results between equivalent forms 
of some regular expressions:

 >>> sys.version
'2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'
 >>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0)
u'n'
 >>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
u'\xf1'

In the above example, u'\xf1' is n-with-tilde (ñ).  NFC happens to be a 
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde 
as a combining mark.

Is this a limitation-by-design, or a bug?  If the latter, is it already 
known/to-be-fixed?



More information about the Python-list mailing list