[XML-SIG] Potential issue with re too (Was: Issues with Unicode type)

Eric van der Vlist vdv@dyomedea.com
23 Sep 2002 22:27:17 +0200


Still in the context of WXS datatypes and their facets, there is a
potential issue with regular expressions (needed for the pattern facet):

>>> print c.__repr__()
u'\u10800'
>>> print re.findall(".", c)
[u'\u1080', u'0']
>>> print re.findall(c, c)
[u'\u10800']
>>> print re.findall(u'\u1080', c)
[u'\u1080']
>>> print re.findall(u'0', c)
[u'0']

The re module handles surrogates according to their dual nature,
counting them as two characters (which is not what's expected by let's
say "." or ".{2}") but still recognizing it as u'\u10800' which doesn't
seem like a safe basis to build a compliant type library.

Eric
--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------