python 3.3 repr

Fri Nov 15 12:10:45 EST 2013

On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote:

> Things went wrong when utf8 was not adopted as the standard encoding
> thus requiring two string types, it would have been easier to have a len
> function to count bytes as before and a glyphlen to count glyphs. Now as
> I understand it we have a complicated mess under the hood for unicode
> objects so they have a variable representation to approximate an 8 bit
> representation when suitable etc etc etc.

No no no! Glyphs are *pictures*, you know the little blocks of pixels 
that you see on your monitor or printed on a page. Before you can count 
glyphs in a string, you need to know which typeface ("font") is being 
used, since fonts generally lack glyphs for some code points.

[Aside: there's another complication. Some fonts define alternate glyphs 
for the same code point, so that the design of (say) the letter "a" may 
vary within the one string according to whatever typographical rules the 
font supports and the application calls. So the question is, when you 
"count glyphs", should you count "a" and "alternate a" as a single glyph 
or two?]

You don't actually mean count glyphs, you mean counting code points 
(think characters, only with some complications that aren't important for 
the purposes of this discussion).

UTF-8 is utterly unsuited for in-memory storage of text strings, I don't 
care how many languages (Go, Haskell?) make that mistake. When you're 
dealing with text strings, the fundamental unit is the character, not the 
byte. Why do you care how many bytes a text string has? If you really 
need to know how much memory an object is using, that's where you use 
sys.getsizeof(), not len().

We don't say len({42: None}) to discover that the dict requires 136 
bytes, why would you use len("heåvy") to learn that it uses 23 bytes?

UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is 
slow. If you have mutable strings, deleting or inserting characters is 
slow. Every operation has to effectively start at the beginning of the 
string and count forward, lest it split bytes in the middle of a UTF 
unit. Or worse, the language doesn't give you any protection from this at 
all, so rather than slow string routines you have unsafe string routines, 
and it's your responsibility to detect UTF boundaries yourself. 

In case you aren't familiar with what I'm talking about, here's an 
example using Python 3.2, starting with a Unicode string and treating it 
as UTF-8 bytes:

py> u = "heåvy"
py> s = u.encode('utf-8')
py> for c in s:
...     print(chr(c))
...
h
e
Ã
¥
v
y

"Ã¥"? It didn't take long to get moji-bake in our output, and all I did 
was print the (byte) string one "character" at a time. It gets worse: we 
can easily end up with invalid UTF-8:

py> a, b = s[:len(s)//2], s[len(s)//2:]  # split the string in half
py> a.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2: 
unexpected end of data
py> b.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: 
invalid start byte

No, UTF-8 is okay for writing to files, but it's not suitable for text 
strings. The in-memory representation of text strings should be constant 
width, based on characters not bytes, and should prevent the caller from 
accidentally ending up with moji-bake or invalid strings.

-- 
Steven