xsl and unicode surrogate characters
"Martin v. Löwis"
martin at v.loewis.de
Wed Jan 4 21:09:02 EST 2006
Sakcee wrote:
> Hi
>
> In one of the data files that I have , I am seeing these characters
> \xed\xa0\xa0 . They seem to break the xsl.
[...]
> is this a unicode utf-16 surrogate pair ?
Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.
Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.
> for displaying it on xml/xsl, should I extract only \xa0?
You should tell your parser to reject the file as ill-formed.
> since this is hingher than 00-7f range can i just strip it?
Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.
> under what condition the encoding software put this string in?
If it has a bug.
Regards,
Martin
More information about the Python-list
mailing list