python 2.7 and unicode (one more time)

Thu Nov 20 19:32:12 EST 2014

Marko Rauhamaa wrote:

> Michael Torrie <torriem at gmail.com>:
> 
>> Unicode can only be encoded to bytes.
>> Bytes can only be decoded to unicode.
> 
> I don't really like it how Unicode is equated with text, or even
> character strings.

That surely depends on the context. To be technically correct, Unicode is a
character set together with a set of rules for dealing with them (e.g.
rules for uppercasing characters, sorting rules, etc.). When referring to
the standard, "Unicode" is a noun; when referring to text, it is actually
an adjective being used as a noun. That is, "Unicode text" has become
abbreviated as just "Unicode" in much the same way as "human beings" has
become abbreviated as just "humans".

In that sense, "text is Unicode" just means "in the context in which we are
talking, when I say 'text' I mean 'Unicode text' as opposed to (for
example) 'ASCII text' or 'KOI-8 text'." It certainly doesn't mean that
*all* text in other contexts are Unicode, since that is obviously untrue.

(E.g. there are millions of existing files across the world containing text
which use legacy encodings that are not compatible with Unicode.)

> There's barely any difference between the truth value of these
> statements:
> 
>    Python strings are ASCII.
> 
>    Python strings are Latin-1.
> 
>    Python strings are Unicode.
> 
> Each of those statements is true as long as you stay within the
> respective character sets, and cease to be true when your text contains
> characters outside the character sets.

When we say "Python strings are FOO", we are making a statement about
arbitrary Python strings, not a particular set of concrete examples of
strings. If Python strings are FOO, that means that for all possible Python
strings s, "s is FOO" is a true statement.

We cannot say that Python strings are uppercase, because we can easily find
counter-examples such as 'xyz'. Likewise we cannot say Python strings are
ASCII, or Latin-1, because we can easily find counter-examples such as 'Ř'

On the other hand, Python strings *are* Unicode, because by design Python
strings are limited to Unicode. Every Python string is a Unicode string.

> Now, it is true that Python currently limits itself to the 1,114,112
> Unicode code points. And it likely won't adopt more characters unless
> Unicode does it first. However, text is something more lofty and
> abstract than a sequence of Unicode code points.

You are certainly correct that in it's full generality, "text" is much more
than just a string of code points. Unicode strings is a primitive data
type. A powerful and sophisticated text processing application may even
find Python strings too primitive, possibly needing something like ropes of
graphemes rather than strings of code points.

We Western and Northern European speakers -- and I don't know whether Finns
are counted as Northern Europeans or Eastern Europeans -- are lucky in that
our natural languages are well-covered by Unicode. All our graphemes are
also code points, even the "funny ones with accents". As an English
speaker. I have to remind myself that not every grapheme is a single code
point, but Devanagari or Navajo writers will never make that mistake.

> We shouldn't call strings Unicode any more than we call numbers IEEE or
> times ISO.

We certainly shouldn't call numbers IEEE, but we might very well call them
IEEE-754. Actually, since IEEE-754 covers multiple formats, we have to be
more specific:

Python floats are IEEE-754 double-precision binary floats.

-- 
Steven