[Python-3000] Unicode and OS strings

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 14 06:52:45 CEST 2007


Greg Ewing writes:

 > Stephen J. Turnbull wrote:

 > > What should happen internally is that all undecodable characters
 > > (which PUA characters are by definition for standard codecs) are
 > > mapped to unused codepoints in the PUA, chosen by Python.
 > 
 > You mean chosen dynamically?

Yes.

 > What happens if these PUA characters get encoded some other way,

You can't win that, because Unicode is the only encoding that attempts
to guarantee even the possibility of round-tripping.  The only thing
you can win is if it's the *same* character set (which might be used
by multiple encodings), and then we record the character set and the
code point.  That's the best we can do in theory.

The main problem with this scheme that I know of is that if you have a
Python string that contains such a code point, you'll need to somehow
include the information about the original encoding when pickling and
the like.


More information about the Python-3000 mailing list