How do I display unicode value stored in a string variable using ord()

Paul Rubin no.email at nospam.invalid
Sat Aug 18 14:26:21 EDT 2012


Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
> using two code points. This is fragile and doesn't work very well, 
> because string-handling methods can break the surrogate pairs apart, 
> leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient 
> format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.



More information about the Python-list mailing list