How do I display unicode value stored in a string variable using ord()

MRAB python at mrabarnett.plus.com
Sat Aug 18 14:59:32 EDT 2012


On 18/08/2012 19:26, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
>> using two code points. This is fragile and doesn't work very well,
>> because string-handling methods can break the surrogate pairs apart,
>> leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
>
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
>
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.



More information about the Python-list mailing list