How is unicode implemented behind the scenes?

Dan Sommers dan at tombstonezero.net
Sun Mar 9 00:46:06 EST 2014


On Sun, 09 Mar 2014 03:50:49 +0000, Steven D'Aprano wrote:

> ... UTF-16 ... the letter "A" is stored as two bytes 0x0041 (or 0x4100
> depending on your platform's byte order) ...

At the risk of being pedantic, the two bytes are 0x00 and 0x41, and the
order in which they appear in memory depends on your platform and even
your particular view of that platform (do stacks grow up or down?  are
addresses of higher memory larger or smaller?).

> ... UTF-32 ... "A" would be stored as 0x00000041 or 0x41000000 ...

Or even some other sequence if you're on a PDP-11.

See <http://www.catb.org/jargon/html/M/middle-endian.html>.

But you knew that.  ;-)

Pedantic'ly yours,
Dan



More information about the Python-list mailing list