xsl and unicode surrogate characters

Diez B. Roggisch deets at nospam.web.de
Thu Jan 5 04:42:15 EST 2006


Sakcee wrote:

> thanks very much for the info, it really helped
> 
> we are using the text from file to display on webpage and we have a
> method for conversion the parsed data to utf-8 and then displaying, all
> the data looks fine after parsing except the
> surrogate pair,
> since i can not guess what it was supposed to be , is it ok to strip it
>  using regex re.complie(' [\xed|\xa0] ')?

As martin said: that alters the meaning of the bytes. If that has to bother
you or not, that's yours to decide. If for example you stripped all vocals
from a text, it still might be comprehensible for most people, so if vocals
bother you for whatever reason, remove them. 

Bt myb y bttr try nd fx th prblm n th frst plc.

Regards,

Diez



More information about the Python-list mailing list