tuples, index method, Python's design
Roel Schroeven
rschroev_nospam_ml at fastmail.fm
Sun Apr 15 10:56:58 EDT 2007
Paul Rubin schreef:
> "Rhamphoryncus" <rhamph at gmail.com> writes:
>> Indexing cost, memory efficiency, and canonical representation: pick
>> two. You can't use a canonical representation (scalar values) without
>> some sort of costly search when indexing (O(log n) probably) or by
>> expanding to the worst-case size (UTF-32). Python has taken the
>> approach of always providing efficient indexing (O(1)), but you can
>> compile it with either UTF-16 (better memory efficiency) or UTF-32
>> (canonical representation).
>
> I still don't get it. UTF-16 is just a data compression scheme, right?
> I mean, s[17] isn't the 17th character of the (unicode) string regardless
> of which memory byte it happens to live at? It could be that that accessing
> it takes more than constant time, but that's hidden by the implementation.
>
> So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
I didn't get it either, but now I understand. Like you, I thought Python
Unicode strings contain a canonical representation (in interface, not
necessarily in implementation) but apparently that is not true; see
Neil's post and the reference manual
(http://docs.python.org/ref/types.html#l2h-22).
A simple example on my Python installation, apparently compiled to use
UTF-16 (sys.maxunicode == 65535):
>>> s = u'\u1d400'
>>> s.index(s)
0
>>> s[0]
u'\u1d40'
>>> s == s[0]
False
In this case s[0] is not the full Unicode scalar, but instead just the
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
0x0000 (in s[1]).
--
If I have been able to see further, it was only because I stood
on the shoulders of giants. -- Isaac Newton
Roel Schroeven
More information about the Python-list
mailing list