tuples, index method, Python's design

Sun Apr 15 06:43:31 EDT 2007

Paul Rubin:

> I still don't get it.  UTF-16 is just a data compression scheme, right?
> I mean, s[17] isn't the 17th character of the (unicode) string regardless
> of which memory byte it happens to live at?  It could be that that accessing
> it takes more than constant time, but that's hidden by the implementation.

    Python Unicode strings are arrays of code units which are either 16 
or 32 bits wide with the width of a code unit determined when Python is 
compiled. s[17] will be the 18th code unit of the string and is found by 
indexing with no ancillary data structure or processing to interpret the 
string as a sequence of code points.

    This is the same technique used by other languages such as Java. 
Implementing the Python string type with a data structure that can 
switch between UTF-8, UTF-16 and UTF-32 while preserving the appearance 
of a UTF-32 sequence has been proposed but has not gained traction due 
to issues of complexity and cost.

    Neil