UTF-8 question from Dive into Python 3

Antoine Pitrou solipsis at pitrou.net
Wed Jan 19 11:27:25 EST 2011


On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
Tim Harig <usernet at ilthio.net> wrote:
> 
> For many operations, it is just much faster and simpler to use a single
> character based container opposed to having to process an entire byte
> stream to determine individual letters from the bytes or to having
> adaptive size containers to store the data.

You *have* to "process the entire byte stream" in order to determine
boundaries of individual letters from the bytes if you want to use a
"character based container", regardless of the exact representation.
Once you do that it shouldn't be very costly to compute the actual code
points. So, "much faster" sounds a bit dubious to me; especially if you
factor in the cost of memory allocation, and the fact that a larger
container will fit less easily in a data cache.

> That said, and more importantly, many
> variable length byte streams may not have alternate representations as
> unicode does.

This whole thread is about UTF-8 (see title) so I'm not sure what kind
of relevance this is supposed to have.





More information about the Python-list mailing list