tuples, index method, Python's design

Sun Apr 15 10:56:58 EDT 2007

Paul Rubin schreef:
> "Rhamphoryncus" <rhamph at gmail.com> writes:
>> Indexing cost, memory efficiency, and canonical representation: pick
>> two.  You can't use a canonical representation (scalar values) without
>> some sort of costly search when indexing (O(log n) probably) or by
>> expanding to the worst-case size (UTF-32).  Python has taken the
>> approach of always providing efficient indexing (O(1)), but you can
>> compile it with either UTF-16 (better memory efficiency) or UTF-32
>> (canonical representation).
> 
> I still don't get it.  UTF-16 is just a data compression scheme, right?
> I mean, s[17] isn't the 17th character of the (unicode) string regardless
> of which memory byte it happens to live at?  It could be that that accessing
> it takes more than constant time, but that's hidden by the implementation.
> 
> So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

I didn't get it either, but now I understand. Like you, I thought Python 
Unicode strings contain a canonical representation (in interface, not 
necessarily in implementation) but apparently that is not true; see 
Neil's post and the reference manual 
(http://docs.python.org/ref/types.html#l2h-22).

A simple example on my Python installation, apparently compiled to use 
UTF-16 (sys.maxunicode == 65535):

 >>> s = u'\u1d400'
 >>> s.index(s)
0
 >>> s[0]
u'\u1d40'
 >>> s == s[0]
False

In this case s[0] is not the full Unicode scalar, but instead just the 
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and 
0x0000 (in s[1]).

-- 
If I have been able to see further, it was only because I stood
on the shoulders of giants.  -- Isaac Newton

Roel Schroeven