Unicode regular expressions -- buggy?
Christopher Subich
spam.csubich+block at block.subich.spam.com
Thu Aug 11 02:52:46 EDT 2005
I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
>>> sys.version
'2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'
>>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0)
u'n'
>>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
u'\xf1'
In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.
Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?
More information about the Python-list
mailing list