UTF-8 question from Dive into Python 3

Terry Reedy tjreedy at udel.edu
Wed Jan 19 17:33:43 EST 2011


On 1/19/2011 1:02 PM, Tim Harig wrote:

> Right, but I only have to do that once.  After that, I can directly address
> any piece of the stream that I choose.  If I leave the information as a
> simple UTF-8 stream, I would have to walk the stream again, I would have to
> walk through the the first byte of all the characters from the beginning to
> make sure that I was only counting multibyte characters once until I found
> the character that I actually wanted.  Converting to a fixed byte
> representation (UTF-32/UCS-4) or separating all of the bytes for each
> UTF-8 into 6 byte containers both make it possible to simply index the
> letters by a constant size.  You will note that Python does the former.

The idea of using a custom fixed-width padded version of a UTF-8 steams 
waw initially shocking to me, but I can imagine that there are 
specialized applications, which slice-and-dice uninterpreted segments, 
for which that is appropriate. However, it is not germane to the folly 
of prefixing standard UTF-8 steams with a 3-byte magic number, 
mislabelled a 'byte-order-mark, thus making them non-standard.

-- 
Terry Jan Reedy




More information about the Python-list mailing list