[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
26 Sep 2002 14:17:26 +0200


Eric van der Vlist <vdv@dyomedea.com> writes:

> OTH, working on implementations of standards (or recs) without aiming
> for complete conformance is something which I consider as dangerous and
> I am reaching a point where Python doesn't look as a adequate plateform
> to implement W3C XML Schema datatypes (and hardly an adequate platform
> to implement Relax NG) because of the lack of support of non BMP code
> points.

Please understand that Python is free software. So if it does not fit
your needs, you can:
a) adjust your needs, or
b) adjust Python, or
c) not use Python.

It is only for non-free software where b) is no option.

> The two issues which I am currently aware of are the length of the
> strings which can be solved by implementing an application level length
> algorithm and, more serious, the support of the regular expressions
> required for the "pattern" facet for which I don't see how we could rely
> on the Python regexp features which are buggy when compiled as ucs4 and
> will not produce the expected result when compiled as ucs2. 
> 
> Unless we rely on external C extensions such as the ones developed by
> Daniel for libxml, I just see no way to be "natively conform"!

I think this is a simplification: You can certainly implement the len
algorithm without regular expressions at all:

if sys.maxunicode == 65535:
  def smart_len(s):
    l = 0
    for c in s:
      if not 0xd800 <= ord(i) < 0xdc00:
        # skip high surrogates - only count the low surrogates
        l += 1
    return l
else:
  smart_len = len

The same applies for NCName: You do not *have* to use regular
expressions. Instead, build a dictionary 

NCName = {}
for char in all_ncname_chars:
  NCName[char] = 1

With that, you can test whether a character is allowed with
NCName.has_key(char).

> Again, we can say that it won't matter for "real life applications" and
> that we don't care about conformance but that's a dangerous path.

My code shows that there is a fourth option, in addition to fixing
Python: 

d) work around the bugs and limitations

Python is Turing-complete, so there is no algorithmic problem that
cannot be solved in Python. So, saying that you cannot "natively
conform" is an oversimplification.

Regards,
Martin