How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 19 02:09:49 EDT 2012


This is a long post. If you don't feel like reading an essay, skip to the 
very bottom and read my last few paragraphs, starting with "To recap".


On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
> 
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used 
for converting text as bytes for transmission over the wire or storage on 
disk. PEP 393 talks about the internal representation of text within 
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a 
"narrow build", text is stored using two-bytes per character, so the 
string "len" (as in the name of the built-in function) will be stored as 

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is 
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much 
memory as needed. This standard data structure is called UCS-2, and it 
only handles characters in the Basic Multilingual Plane, the BMP (roughly 
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len" 
is stored as either:

0000006c 00000065 0000006e
6c000000 65000000 6e000000

Now memory is cheap, but it's not *that* cheap, and no matter how much 
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode 
character set, for now and forever. (If we ever need more that four-bytes 
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters 
[technically: code points] in the Basic Multilingual Plane? There's an 
extension to UCS-2 called UTF-16 which extends it to the entire Unicode 
range. Yes, that's the same name as the UTF-16 encoding, because it's 
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but 
characters outside the BMP by four bytes." There's a neat trick to this: 
the BMP doesn't use the entire two-byte range, so there are some byte 
pairs which are illegal in UCS-2 -- they don't correspond to *any* 
character. UTF-16 used those byte pairs to signal "this is half a 
character, you need to look at the next pair for the rest of the 
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string 
is by counting the number of bytes, which is fast, because sometimes four 
bytes is two characters and sometimes it's one and you can't tell which 
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to 
grab the 10th characters in a string. The fast way using UCS-2 is to 
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we 
start counting at zero) and you're done. Fast and safe if you're willing 
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice 
as much space, so you probably end up spending so much time copying null 
bytes that you're probably slower anyway. Especially when your OS starts 
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 
and 9 are half of a surrogate pair, and you've now split the pair and 
ended up with an invalid string. That's what Python 3.2 does, it fails to 
handle surrogate pairs properly:

py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'


I've just split a single valid Unicode character into two invalid 
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that 
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair 
of bytes in order to index a string, or work out it's length, or copy a 
substring. It's not enough to just check if the last pair is a surrogate. 

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the 
internal representation of strings -- they are either fast or correct but 
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 
is too because ASCII-only strings like identifiers end up being four 
times as big as they need to be. 1-byte schemes like Latin-1 are 
unspeakable because they only handle 256 characters, fewer if you don't 
count the C0 and C1 control codes.

PEP 393 to the rescue! What if you could encode pure-ASCII strings like 
"len" using one byte per character, and BMP strings using two bytes per 
character (UCS-2), and fall back to four bytes (UCS-4) only when you 
really need it?

The benefits are:

* Americans and English-Canadians and Australians and other barbarians of 
that ilk who only use ASCII save a heap of memory;

* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per 
character;

* people who use lots of non-BMP characters are no worse off.

The costs are:

* string routines need to be smarter -- they have to handle three 
different data structures (ASCII, UCS-2, UCS-4) instead of just one;

* there's a certain amount of overhead when creating a string -- you have 
to work out which in-memory format to use, and that's not necessarily 
trivial, but at least it's a once-off cost when you create the string;

* people who misunderstand what's going on get all upset over micro-
benchmarks.


> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

To recap:

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
operations are not O(1) but are O(N). That means they are slow, or buggy, 
pick one.

* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP. 
That's better than it sounds: the BMP supports most character sets, but 
not all. Still, there are people who need the supplementary planes, and 
UCS-2 lets them down.

* Fixed width UCS-4 does handle the full Unicode range, without 
surrogates, but at the cost of using 2-4 times more string memory for the 
vast majority of users.

* PEP 393 doesn't use variable-width characters, but variable-width 
strings. Instead of choosing between 1, 2 and 4 bytes per character, it 
chooses *per string*. This keeps basic string operations O(1) instead of 
O(N), saves memory where possible, while still supporting the full 
Unicode range without a compile-time option.



-- 
Steven



More information about the Python-list mailing list