UTF-8 question from Dive into Python 3

Wed Jan 19 13:45:35 EST 2011

On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
Tim Harig <usernet at ilthio.net> wrote:
> On 2011-01-19, Antoine Pitrou <solipsis at pitrou.net> wrote:
> > On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
> > Tim Harig <usernet at ilthio.net> wrote:
> >> 
> >> For many operations, it is just much faster and simpler to use a single
> >> character based container opposed to having to process an entire byte
> >> stream to determine individual letters from the bytes or to having
> >> adaptive size containers to store the data.
> >
> > You *have* to "process the entire byte stream" in order to determine
> > boundaries of individual letters from the bytes if you want to use a
> > "character based container", regardless of the exact representation.
> 
> Right, but I only have to do that once.

You only have to decode once as well.

> If I leave the information as a
> simple UTF-8 stream,

That's not what we are talking about. We are talking about the supposed
benefits of your 6-byte representation scheme versus proper decoding
into fixed width code points.

> UTF-32/UCS-4 conversion is definitly supperior if you are actually
> doing any major but it adds the complexity and overhead of requiring
> the bit twiddling to make the conversions (once in, once again out).

"Bit twiddling" is not something processors are particularly bad at.
Actually, modern processors are much better at arithmetic and logic
than at recovering from mispredicted branches, which seems to suggest
that discovering boundaries probably eats most of the CPU cycles.

> Converting to a fixed byte
> representation (UTF-32/UCS-4) or separating all of the bytes for each
> UTF-8 into 6 byte containers both make it possible to simply index the
> letters by a constant size.  You will note that Python does the
> former.

Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.