[XML-SIG] Character encodings and expat

M.-A. Lemburg mal@lemburg.com
Mon, 30 Oct 2000 23:57:10 +0100


"Martin v. Loewis" wrote:
> 
> > But there is a private use area in the BMP as well... and if you
> > plan to write round-trip safe codecs for corporate character sets,
> > then you'll have to use these to make the transfer safe.
> 
> Well, you can't make round-trip encoding safe for them - that is the
> very nature of the private use area. If convert set A to Unicode,
> using the private map, then convert to set B, and back from there, you
> likely lose.

True. With "round trip" I meant encoding A -> Unicode -> encoding A.
This is often needed in order to do processing on the data and
should be a 1-1 mapping if possible.
 
> If there are "official" mappings between some corporate's character
> set and Unicode, then I'd expect all converters that support the
> corporate character set also to treat the private use area in the same
> way.
> 
> If there are no official mappings published by the corporation, then
> you are better of using the platform converters on the corporation's
> operating system. Those will definitely get the private use area
> right; the ones provided by Python in a cross-platform cross-vendor
> way might not.

Right.

Perhaps the codecs should warn about these conversions by applying
error handling to them (raise exceptions, ignore, replace, etc.) ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/