UTF-8 question from Dive into Python 3

Thu Jan 20 16:15:21 EST 2011

On Jan 19, 11:33 pm, Terry Reedy <tjre... at udel.edu> wrote:
> On 1/19/2011 1:02 PM, Tim Harig wrote:
>
> > Right, but I only have to do that once.  After that, I can directly address
> > any piece of the stream that I choose.  If I leave the information as a
> > simple UTF-8 stream, I would have to walk the stream again, I would have to
> > walk through the the first byte of all the characters from the beginning to
> > make sure that I was only counting multibyte characters once until I found
> > the character that I actually wanted.  Converting to a fixed byte
> > representation (UTF-32/UCS-4) or separating all of the bytes for each
> > UTF-8 into 6 byte containers both make it possible to simply index the
> > letters by a constant size.  You will note that Python does the former.
>
> The idea of using a custom fixed-width padded version of a UTF-8 steams
> waw initially shocking to me, but I can imagine that there are
> specialized applications, which slice-and-dice uninterpreted segments,
> for which that is appropriate. However, it is not germane to the folly
> of prefixing standard UTF-8 steams with a 3-byte magic number,
> mislabelled a 'byte-order-mark, thus making them non-standard.
>


Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe
*Unicode Signature*.