[I18n-sig] How does Python Unicode treat surrogates?
Tom Emerson
tree@basistech.com
Mon, 25 Jun 2001 15:33:35 -0400
Guido van Rossum writes:
> And that's where your proposal simple doesn't work. If the storage
> units are all 16 bits, and you want the index to count in characters,
> you can't know where in a megabyte-long string to start looking for
> character 1,000,000: you have to iterate over the storage units from
> the beginning until you have counted 1,000,000 characters. If there
> were no surrogates, that's 1,000,000 storage units from the beginning;
> if all characters happened to be surrogates, it would be 2,000,000
> storage units. If there are n surrogates between character 0 and
> character n, character n starts at storage unit offset n+m; the only
> way to determine m is a brute-force O(n) search.
Bing, the light goes on. Of course. "Never mind." :-)
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"