UTF-8 question from Dive into Python 3

Wed Jan 19 11:03:11 EST 2011

On 2011-01-19, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
> Tim Harig <usernet at ilthio.net> wrote:
>> UTF-8 has no apparent endianess if you only store it as a byte stream.
>> It does however have a byte order.  If you store it using multibytes
>> (six bytes for all UTF-8 possibilites) , which is useful if you want
>> to have one storage container for each letter as opposed to one for
>> each byte(1)
>
> That's a ridiculous proposition. Why would you waste so much space?

Space is only one tradeoff.  There are many others to consider.  I have
created data structures with much higher overhead than that because
they happen to make the problem easier and significantly faster for the
operations that I am performing on the data.

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

> UTF-8 exists *precisely* so that you can save space with most scripts.

UTF-8 has many reasons for existing.  One of the biggest is that it
is compatible for tools that were designed to process ASCII and other
8bit encodings.

> If you are ready to use 4+ bytes per character, just use UTF-32 which
> has much nicer properties.

I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
not want to have to worry about converting the encodings back and forth
before and after processing them.  That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.