xsl and unicode surrogate characters

"Martin v. Löwis" martin at v.loewis.de
Wed Jan 4 21:09:02 EST 2006


Sakcee wrote:
> Hi
> 
> In one of the data files that I have , I am seeing these characters
> \xed\xa0\xa0 .  They seem to break the xsl.
[...]
> is this a unicode utf-16 surrogate pair ?

Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.

> for displaying it on xml/xsl, should I extract only \xa0?

You should tell your parser to reject the file as ill-formed.

> since this is hingher than 00-7f range can i just strip it?

Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.

> under what condition the encoding software put this string in?

If it has a bug.

Regards,
Martin



More information about the Python-list mailing list