How is unicode implemented behind the scenes?

Sat Mar 8 22:19:00 EST 2014

On Sun, Mar 9, 2014 at 2:01 PM, Roy Smith <roy at panix.com> wrote:
> In article <531bd709$0$29985$c3e8da3$5496439d at news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
>
>> There are various common ways to store Unicode strings in RAM.
>>
>> The first, UTF-16.
>> [...]
>> Another option is UTF-32.
>> [...]
>> Another option is to use UTF-8 internally.
>> [...]
>> In Python 3.3, CPython introduced an internal scheme that gives the best
>> of all worlds. When a string is created, Python uses a different
>> implementation depending on the characters in the string:
>
> This was an excellent post, but I would take exception to the "best of
> all worlds" statement.  I would put it a little less absolutely and say
> something like, "a good compromise for many common use cases".  I would
> even go with, "... for most common use cases".  But, there are
> situations where it loses.

It's universally good for string indexing/slicing on binary CPUs
(there's no point using a 24-bit or 21-bit representation on an
Intel-compatible CPU, even though they'd be just as good as UTC-32).
It's not a compromise, so much as a recognition that Python offers
convenient operators for indexing and slicing. If, on the other hand,
Python fundamentally worked with U+0020 separated words (REXX has a
whole set of word-based functions), then it might be better to
represent strings as lists of words internally. Or if the string
operations are primarily based on the transitions between Unicode
types of "space" and "non-space", which would be more likely these
days, then something of that sort would still work. Anyway, it's based
on the operations the language makes convenient, and which will
therefore be common and expected to be fast: those are the operations
to optimize for.

If the only thing you ever do with a string is iterate sequentially
over its characters, UTF-8 would be the perfect representation. It's
compact, you can concatenate strings without re-encoding, and it
iterates forwards easily. But it sucks for "give me character #142857
from this string", so it's a bad choice for Python.

ChrisA