Beazley 4E P.E.R, Page29: Unicode

Sun Jul 14 04:18:15 EDT 2013

On Sat, 13 Jul 2013 20:09:31 -0700, vek.m1234 wrote:

> http://stackoverflow.com/questions/17632246/beazley-4e-p-e-r-page29-
unicode
> 
> "directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o'
> simply produces a nine-character string U+004A, U+0061, U+006C, U+0061,
> U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you
> intended. This is because in UTF-8, the multi-byte sequence \xc3\xb1 is
> supposed to represent the single character U+00F1, not the two
> characters U+00C3 and U+00B1."

This demonstrates confusion of the fundamental concepts, while still 
accidentally stumbling across the basic facts right. No wonder it is 
confusing you, it confuses me too! :-)

Encoding does not generate a character string, it generates bytes. So the 
person you are quoting is causing confusion when he talks about an 
"encoded string", he should either make it clear he means a string of 
bytes, or not mention the word string at all. Either of these would work:

... a UTF-8 encoded byte-string b'Jalape\xc3\xb1o'

... UTF-8 encoded bytes b'Jalape\xc3\xb1o'

For older versions of Python (2.5 or older), unfortunately the b'' 
notation does not work, and you have to leave out the b.

Even better would be if Python did not conflate ASCII characters with 
bytes, and forced you to write byte strings like this:

... a UTF-8 encoded byte-string b'\x4a\x61\x6c\x61\x70\x65\xc3\xb1\x6f'

thus keeping the distinction between ASCII characters and bytes clear. 
But that would break backwards compatibility *way* too much, and so 
Python continues to conflate ASCII characters with bytes, even in Python 
3. But I digress.

The important thing here is that bytes b'Jalape\xc3\xb1o' consists of 
nine hexadecimal values, as shown above. Seven of them represent the 
ASCII characters Jalape and o and two of them are not ASCII. Their 
meaning depends on what encoding you are using.

(To be precise, even the meaning of the other seven bytes depends on the 
encoding. Fortunately, or unfortunately as the case may be, *most* but 
not all encodings use the same hex values for ASCII characters as ASCII 
itself does, so I will stop mentioning this and just pretend that 
character J always equals hex byte 4A. But now you know the truth.)

Since we're using the UTF-8 encoding, the two bytes \xc3\xb1 represent 
the character ñ, also known as LATIN SMALL LETTER N WITH TILDE. In other 
encodings, those two bytes will represent something different.

So, I presume that the original person's *intention* was to get a Unicode 
text string 'Jalapeño'. If they were wise in the ways of Unicode, they 
would write one of these:

'Jalape\N{LATIN SMALL LETTER N WITH TILDE}o'
'Jalape\u00F1o'
'Jalape\U000000F1o'
'Jalape\xF1o'  # hex
'Jalape\361o'  # octal

and be happy. (In Python 2, they would need to prefix all of these with 
u, to use Unicode strings instead of byte strings.)

But alas they have been mislead by those who propagate myths, 
misunderstandings and misapprehensions about Unicode all over the 
Internet, and so they looked up ñ somewhere, discovered that it has the 
double-byte hex value c3b1 in UTF-8, and thought they could write this:

'Jalape\xc3\xb1o'

This does not do what they think it does. It creates a *text string*, a 
Unicode string, with NINE characters:

J a l a p e Ã ± o

Why? Because character Ã has ordinal value 195, which is c3 in hex, hence 
\xc3 is the character Ã; likewise \xb1 is the character ± which has 
ordinal value 177 (b1 in hex). And so they have discovered the wickedness 
that is mojibake.

http://en.wikipedia.org/wiki/Mojibake 

Instead, if they had started with a *byte-string*, and explicitly decoded 
it as UTF-8, they would have been fine:

# I manually encoded 'Jalapeño' to get the bytes below:
bytes = b'Jalape\xc3\xb1o'
print(bytes.decode('utf-8'))

> My original question was: Shouldn't this be 8 characters - not 9? He
> says: \xc3\xb1 is supposed to represent the single character. However
> after some interaction with fellow Pythonistas i'm even more confused.

Depends on the context. \xc3\xb1 could mean the Unicode string 
'\xc3\xb1' (in Python 2, written u'\xc3\xb1') or it could mean the byte-
string b'\xc3\xb1' (in Python 2.5 or older, written without the b).

As a string, \xc3\xb1 means two characters, with ordinal values 0xC3 (or 
decimal 195) and 0xB1 (or decimal 177), namely 'Ã' and '±'.

As bytes, \xc3\xb1 represent two bytes (well, duh), which could mean 
nearly anything:

- the 16-bit Big Endian integer 50097

- the 16-bit Little Endian integer 45507

- a 4x4 black and white bitmap

- the character '簽' (CJK UNIFIED IDEOGRAPH-7C3D) in Big5 encoded bytes

- '뇃' (HANGUL SYLLABLE NWAES) in UTF-16 (Big Endian) encoded bytes

- 'ñ' in UTF-8 encoded bytes

- the two characters 'Ã±' in Latin-1 encoded bytes

- '√±' in MacRoman encoded bytes

- 'Γ±' in ISO-8859-7 encoded bytes

and so forth. Without knowing the context, there is no way of telling 
what those two bytes represent, or whether they need to be taken together 
as a pair, or as two distinct things.

> With reference to the above para:
> 1. What does he mean by "writing a raw UTF-8 encoded string"?? 

He means he is confused. You don't get a text string by encoding, you get 
bytes (I will accept "byte-string"). The adjective "raw" doesn't really 
mean anything in this context. You have bytes that were encoded, or you 
have a string containing characters. Raw doesn't really mean anything 
except "hey, pay attention, this is low-level stuff" (for some definition 
of "low level").

> In Python2, once can do 'Jalape funny-n o'. 

Nothing funny about it to Spanish speakers.

Personally, I have always considered "o" to be pretty funny. Say "woman" 
and "women" aloud -- in the first one, it sounds like "w-oo-man", in the 
second it sounds like "w-i-men". Now that's funny. But I digress.

If you type 'Jalapeño' in Python 2 (with or without the b prefix), the 
result you get will depend on your terminal settings, but the chances are 
high that the terminal will internally represent the string as UTF-8, 
which gives you bytes

b'Jalape\xc3\xb1o'

which is *nine* bytes. When printed, your terminal will try to print each 
byte separately, giving:

byte \x4a prints as J
byte \x61 prints as a
byte \x6c prints as l
...

and so forth. If you are *unlucky* your terminal may even be smart enough 
to print the two bytes \xc3\xb1 as one character, giving you the ñ you 
were hoping for. Why unlucky? Because you got the right result by 
accident. Next time you do the same thing, on a different terminal, or 
the same terminal set to a different encoding, you will get a completely 
different result, and think that Unicode is too messed up to use.

Using Python 2.5, here I print the same string three times in a row, 
changing the terminal's encoding each time:

py> print 'Jalape\xc3\xb1o'  # terminal set to UTF-8 
Jalapeño
py> print 'Jalape\xc3\xb1o'  # and ISO-8859-6 (Arabic)
Jalapeأ�o
py> print 'Jalape\xc3\xb1o'  # and ISO-8859-5 (Cyrillic)
JalapeУБo

Which one is "right"? Answer: none of them. Not even the first, which by 
accident just happened to be what we were hoping for.

Really, don't feel bad that you are confused. Between Python 2, and the 
terminal trying *really hard* to do the right thing, it is easy to get 
confused because something the right thing happens and sometimes it 
doesn't.

> This is a 'bytes' string where each glyph is 1 byte long 

Nope. It's a string of characters. Glyphs don't come into it. Glyphs are 
the little pictures of letters that you see on the screen, or printed on 
paper. They could be bitmaps, or fancy vector graphics. They are unlikely 
to be one byte each -- more likely 200 bytes per glyph, based on a very 
rough calculation[1], but depending on whether it is a bitmap, a 
Postscript font, an OpenType font, or something else.

> when stored internally so each glyph is
> associated with an integer as per charset ASCII or Latin-1. If these
> charsets have a funny-n glyph then yay! else nay! There is no UTF-8
> here!! or UTF-16!! These are plain bytes (8 bits).

You're getting closer. But you are right: Python 2 "strings" are byte-
strings, which means UTF-8 doesn't come into it. But your terminal might 
treat those bytes as UTF-8, and so accidentally do the "right" (wrong) 
thing.

> Unicode is a really big mapping table between glyphs and integers and

Not glyphs. Between abstract "characters" and integers, called Code 
Points. Unicode contains:

- distinct letters, digits, characters
- accented letters
- accents on their own
- symbols, emoticons
- ligatures and variant forms of characters
- chars required only for backwards-compatibility with older encodings
- whitespace
- control characters
- code points reserved for private use, which can mean anything you like
- code points reserved as "will never be used"
- code points explicitly labelled "not a character"

and possibly others I have forgotten.

> are denoted as Uxxxx or Uxxxx-xxxx.

The official Unicode notation is:

U+xxxx
U+xxxxx
U+xxxxxx

that is U+ followed by exactly four, five or six hex digits. The U is 
always uppercase. Unfortunately Python doesn't support that notation, and 
you have to use either four or eight hex digits, e.g.:

\uFFFF
\U0010FFFF

For code points (ordinals) up to 255, you can also use hex or octal 
escapes, e.g. \xFF \3FF

> UTF-8 UTF-16 are encodings to store
> those big integers in an efficient manner. 

Almost correct. They're not necessarily efficient.

Unicode code points are just abstract numbers that we give some meaning 
to. Code point 65 (U+0041, because hex 41 == decimal 65) means letter A, 
and so forth. Imagine these abstract code points floating in your head. 
How do you get the abstract concept of a code point into concrete form on 
a computer? The same way *everything* is put in a computer: as bytes, so 
we have to turn each abstract code point (a number) into a series of 
bytes.

Unicode code points range from U+0000 to U+10FFFF, which means we could 
just use exactly three bytes, which take values from 000000 to 10FFFF in 
hexadecimal. Values outside of this range, say 110000, would be an error. 
For reasons of efficiency, it's faster and better to use *four* bytes, 
even though one of the four will always have the value zero.

In a nutshell, that's the UTF-32 encoding: ever character uses exactly 
four bytes. E.g. code point U+0041 (character A) is hex bytes 00000041, 
or possible 41000000, depending on whether your computer is Big Endian or 
Little Endian.

Since *most* text uses quite low ordinal values, that's awfully wasteful 
of memory. So UTF-16 uses just two bytes per character, and a weird 
scheme using so-called "surrogate pairs" for everything that won't fit 
into two bytes. It works, for some definition of "works", but is 
complicated, and you really want to avoid UTF-16 if you need code points 
above U+FFFF.

UTF-8 uses a neat variable encoding where characters with low ordinal 
values get encoded as a single byte (better still: it is the same byte as 
ASCII uses, which means old software that assumes everything in the world 
is ASCII will keep working, well mostly working). Higher ordinals get 
encoded as two, three or four bytes[2]. Best of all, unlike most 
historical variable-width encodings, UTF-8 is self-synchronising. In 
legacy encodings, if a single byte gets corrupted, it can mangle 
*everything* from that point on. With UTF-8, a single corrupted byte will 
mangle only the single code-point containing it, everything following 
will be okay.

> So when DB says "writing a
> raw UTF-8 encoded string" - well the only way to do this is to use
> Python3 where the default string literals are stored in Unicode which
> then will use a UTF-8 UTF-16 internally to store the bytes in their
> respective structures; or, one could use u'Jalape' which is unicode in
> both languages (note the leading 'u').

Python never uses UTF-8 internally for storing strings in memory. Because 
it is a variable width encoding, you cannot index strings efficiently if 
they use UTF-8 for storage.

Instead, Python uses one of three different systems:

- Up to Python 3.3, you have a choice. When you compile the Python 
interpreter, you can choose whether it should use UTF-16 or UTF-32 for in-
memory storage. This choice is called "narrow" or "wide" build. A narrow 
build uses less memory, but cannot handle code points above U+FFFF very 
well. A wide build uses more memory, but handles the complete range of 
code points perfectly.

- Starting in Python 3.3, the choice of how to store the string in memory 
is no longer decided up front when you build the Python interpreter. 
Instead, Python automatically chooses the most efficient internal 
representation for each individual string. Strings which only use ASCII 
or Latin-1 characters use one byte per character; string which use code 
points up to U+FFFF use two bytes per character; and only strings which 
use code points above that use four bytes per character.

> 2. So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for
> readability) what DB is saying is that, the stupid-user would expect
> Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2
> o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8
> characters. Correct?

Kind of. See above.

> 3. Which leaves me wondering what he means by: "This is because in
> UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the
> single character U+00F1, not the two characters U+00C3 and U+00B1"

He means that the single code point U+00F1 (character ñ, n with a tilde) 
is stored as the two bytes c3b1 (in hexadecimal) if you encode it using 
UTF-8. But if you stuff characters \xc3 \xb1 into a Unicode string 
(instead of bytes), then you get two Unicode characters U+00C3 and U+00B1.

To put it another way, inside strings, Python treats the hex escape \xC3 
as just a different way of writing the Unicode code point \u00C3 or 
\U000000C3.

However, if you create a byte-string:

b'Jalape\xc3\xb1o'

by looking up a table of UTF-8 encodings, as presumably the original 
poster did, and then decode those bytes to a string, you will get what 
you expect. Using Python 2.5, where the b prefix is not needed:

py> tasty = 'Jalape\xc3\xb1o'  # actually bytes
py> tasty.decode('utf-8')
u'Jalape\xf1o'
py> print tasty.decode('utf-8')  # oops I forgot to reset my terminal
JalapeУБo
py> print tasty.decode('utf-8')  # terminal now set to UTF-8
Jalapeño

> Could someone take the time to read carefully and clarify what DB is
> saying??

Hope this helps.

[1] Assume the font file is 100K in size, and it has glyphs for 256 
characters. That works out to 195 bytes per glyph.

[2] Technically, the UTF-8 scheme can handle 31-bit code points, up to 
the (hypothetical) code point U+7FFFFFFF, using up to six bytes per code 
point. But Unicode officially will never go past U+10FFFF, and so UTF-8 
also will never go past four bytes per code point.

-- 
Steven