UTF-8 question from Dive into Python 3

carlo sysengp2p at gmail.com
Mon Jan 17 17:51:21 EST 2011


On 17 Gen, 23:34, Antoine Pitrou <solip... at pitrou.net> wrote:
> On Mon, 17 Jan 2011 14:19:13 -0800 (PST)
>
> carlo <syseng... at gmail.com> wrote:
> > Is it true UTF-8 does not have any "big-endian/little-endian" issue
> > because of its encoding method?
>
> Yes.
>
> > And if it is true, why Mark (and
> > everyone does) writes about UTF-8 with and without BOM some chapters
> > later? What would be the BOM purpose then?
>
> "BOM" in this case is a misnomer. For UTF-8, it is only used as a
> marker (a magic number, if you like) to signal than a given text file
> is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
> actually, it does not change with endianness.
>
> (note that it is not required to put an UTF-8 "BOM" at the beginning of
> text files; it is just a hint that some tools use when
> generating/reading UTF-8)
>
> > 2- If that were true, can you point me to some documentation about the
> > math that, as Mark says, demonstrates this?
>
> Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
> encoding. There is no math involved, it just works by construction.
>
> Regards
>
> Antoine.

thank you all, eventually found http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404
which clears up.
No math in fact, as Tim and Antoine pointed out.



More information about the Python-list mailing list