How do I display unicode value stored in a string variable using ord()

Sun Aug 19 04:01:46 EDT 2012

On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:

> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal.  It gets more expensive if you
> want to index far more deeply into the string.  I'm asking how often
> that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through 
it looking for a match. Let's ignore the regular expression engine, since 
it has to look at every character anyway. But you've done your search and 
found your matching text and now want everything *after* it. That's not 
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
    start, end = mo.span()
    result = text[end:]

Easy-peasy, right? But behind the scenes, you have a problem: how does 
Python know where text[end:] starts? With fixed-size characters, that's 
O(1): Python just moves forward end*width bytes into the string. Nice and 
fast.

With a variable-sized characters, Python has to start from the beginning 
again, and inspect each byte or pair of bytes. This turns the slice 
operation into O(N) and the combined op (search + slice) into O(N**2), 
and that starts getting *horrible*.

As always, "everything is fast for small enough N", but you *really* 
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid 
character boundaries doesn't help you, because the string slice method 
cannot know where the indexes came from.

I suppose you could have a "fast slice" and a "slow slice" method, but 
really, that sucks, and besides all that does is pass responsibility for 
tracking character boundaries to the developer instead of the language, 
and you know damn well that they will get it wrong and their code will 
silently do the wrong thing and they'll say that Python sucks and we 
never used to have this problem back in the good old days with ASCII. Boo 
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For 
typical users, you end up wasting memory. That is the complaint driving 
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to 
multiply your string memory by four just in case somebody someday gives 
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably 
all you'll ever need. I hear that the club for people who have all the 
memory they'll ever need is holding their annual general meeting in a 
phone-booth this year.

You could say "Screw the full Unicode standard, who needs more than 64K 
different characters anyway?" Well apart from Asians, and historians, and 
a bunch of other people. If you can control your data and make sure no 
non-BMP characters are used, UCS-2 is fine -- except Python doesn't 
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up 
to the individual programmer to track character boundaries, and we know 
how well that works. Luckily the supplementary planes are only rarely 
used, and people who need them tend to buy more memory and use wide 
builds. People who only need a few non-BMP characters in a narrow build 
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings, 
turn them into suped-up ropes-on-steroids. All those extra indexes mean 
that you don't save any memory. Because the objects are so much bigger 
and more complex, your CPU cache goes to the dogs and your code still 
runs slow.

Which leaves us right back where we started, PEP 393.

> Obviously one can concoct hypothetical examples that would suffer.

If you think "slicing at arbitrary indexes" is a hypothetical example, I 
don't know what to say.

-- 
Steven