[XML-SIG] Re: Issues with Unicode type

23 Sep 2002 22:50:34 +0200

Hi Daniel,

On Mon, 2002-09-23 at 22:35, Daniel Veillard wrote:
> On Mon, Sep 23, 2002 at 07:21:41PM +0200, Eric van der Vlist wrote:
> > Yep, and that's what James Clark is doing in his Java implementation:
> >=20
> >   public int getLength(Object obj) {
> >     String str =3D (String)obj;
> >     int len =3D str.length();
> >     int nSurrogatePairs =3D 0;
> >     for (int i =3D 0; i < len; i++)
> >       if (Utf16.isSurrogate1(str.charAt(i)))
> > 	nSurrogatePairs++;
> >     return len - nSurrogatePairs;
> >   }
> >=20
> > And I need to do the same in Python...
>=20
>   yep, that simple,

Except that it's not the only location where it's broken and that won't
work with regular expressions. If I define a pattern such as ".{5}" I
want to check that this is 5 unicode characters, not 5 words of 16
bits...

I am starting to think that compiling Python for 32 bits might be the
safest way to solve this issue.

Can you confirm that this is what RedHat does by default as mentioned
Uche and do you know the motivations (and eventually downsides) for this
decision?
>=20
> > > Notice also that U+10800 is unassigned even in Unicode 3.2.
> >=20
> > I wonder why he has picked this value!
>=20
>   Because he knew this was well formed and that was in a range where
> this could give troubles to Java (and now Python) implementations=20
> I bet :-)

Yes, the values in his test cases are usually chosen with care and I was
expecting something like that!

Thanks

Eric

--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------