[I18n-sig] How does Python Unicode treat surrogates?

Tim Peters tim@digicool.com
Mon, 25 Jun 2001 17:22:31 -0400


My understanding is that UTF-16 (like UTF-8 in this respect) was
deliberately designed so that given a random pointer into the middle of a
contiguous vector of encodings, it's indeed O(1) to find the start of the
nearest *character* going either forwards or backwards.

"The right way" to solve the character (not binary blob) indexing problem is
to add a search finger to the string, a pair mapping "the last" character
index asked for to the address of the start of its encoding.  Since string
traversal generally moves ahead-- or back --just one character at a time,
the point in the first paragraph assures that traversing a string with N
characters, in whole, takes O(N) time overall.  It's not as simple as base +
offset, but requires no more than a few range compares (plus updating the
finger) per indexing operation.