tuples, index method, Python's design

Rhamphoryncus rhamph at gmail.com
Sun Apr 15 14:35:20 EDT 2007


On Apr 15, 8:56 am, Roel Schroeven <rschroev_nospam... at fastmail.fm>
wrote:
> Paul Rubin schreef:
>
> > "Rhamphoryncus" <rha... at gmail.com> writes:
> >> Indexing cost, memory efficiency, and canonical representation: pick
> >> two.  You can't use a canonical representation (scalar values) without
> >> some sort of costly search when indexing (O(log n) probably) or by
> >> expanding to the worst-case size (UTF-32).  Python has taken the
> >> approach of always providing efficient indexing (O(1)), but you can
> >> compile it with either UTF-16 (better memory efficiency) or UTF-32
> >> (canonical representation).
>
> > I still don't get it.  UTF-16 is just a data compression scheme, right?
> > I mean, s[17] isn't the 17th character of the (unicode) string regardless
> > of which memory byte it happens to live at?  It could be that that accessing
> > it takes more than constant time, but that's hidden by the implementation.
>
> > So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
>
> I didn't get it either, but now I understand. Like you, I thought Python
> Unicode strings contain a canonical representation (in interface, not
> necessarily in implementation) but apparently that is not true; see
> Neil's post and the reference manual
> (http://docs.python.org/ref/types.html#l2h-22).
>
> A simple example on my Python installation, apparently compiled to use
> UTF-16 (sys.maxunicode == 65535):
>
>  >>> s = u'\u1d400'

You're confusing \u, which is followed by 4 digits, and \U, which is
followed by eight:
>>> list(u'\u1d400')
[u'\u1d40', u'0']
>>> list(u'\U0001d400')
[u'\U0001d400']  # UTF-32 output, sys.maxunicode == 1114111
[u'\ud835', u'\udc00']  # UTF-16 output, sys.maxunicode == 65535

--
Adam Olsen, aka Rhamphoryncus




More information about the Python-list mailing list