UTF-8 question from Dive into Python 3

Wed Jan 19 13:02:22 EST 2011

On 2011-01-19, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
> Tim Harig <usernet at ilthio.net> wrote:
>> 
>> For many operations, it is just much faster and simpler to use a single
>> character based container opposed to having to process an entire byte
>> stream to determine individual letters from the bytes or to having
>> adaptive size containers to store the data.
>
> You *have* to "process the entire byte stream" in order to determine
> boundaries of individual letters from the bytes if you want to use a
> "character based container", regardless of the exact representation.

Right, but I only have to do that once.  After that, I can directly address
any piece of the stream that I choose.  If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted.  Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size.  You will note that Python does the former.

UTF-32/UCS-4 conversion is definitly supperior if you are actually
doing any major but it adds the complexity and overhead of requiring
the bit twiddling to make the conversions (once in, once again out).
Some programs don't really care enough about what the data actually
contains to make it worth while.  They just want to be able to use the
characters as black boxes.

> Once you do that it shouldn't be very costly to compute the actual code
> points. So, "much faster" sounds a bit dubious to me; especially if you

You could I suppose keep a separate list of pointers to each letter so that
you could use the pointer list for indexing or keep a list of the
character sizes so that you can add them and calculate the variable width
index; but, that adds overhead as well.