tuples, index method, Python's design

Rhamphoryncus rhamph at gmail.com
Sun Apr 15 02:29:56 EDT 2007


On Apr 14, 11:59 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> "Rhamphoryncus" <rha... at gmail.com> writes:
> > Nope, it's pretty fundamental to working with text, unicode only being
> > an extreme example: there's a wide number of ways to break down a
> > chunk of text, making the odds of "e" being any particular one fairly
> > low.  Python's unicode type only makes this slightly worse, not
> > promising any particular one is available.
>
> I don't understand this.  I thought that unicode was a character
> coding system like ascii, except with an enormous character set
> combined with a bunch of different algorithms for encoding unicode
> strings as byte sequences.  But I've thought of those algorithms
> (UTF-8 and so forth) as basically being kludgy data compression
> schemes, and unicode strings are still just sequences of code points.

Indexing cost, memory efficiency, and canonical representation: pick
two.  You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32).  Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

As an aside, I feel the need to clarify the terms "code points" and
"scalar values".  The only difference is that "code points" includes
the surrogates, whereas "scalar values" does not.  As the surrogates
are just an encoding detail of UTF-16 I feel this makes "scalar
values" the more canonical term.  It's all quite confusing though x_x.

--
Adam Olsen, aka Rhamphoryncus




More information about the Python-list mailing list