python 2.7 and unicode (one more time)

Thu Nov 20 18:49:28 EST 2014

On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Michael Torrie <torriem at gmail.com>:
>
>> Unicode can only be encoded to bytes.
>> Bytes can only be decoded to unicode.
>
> I don't really like it how Unicode is equated with text, or even
> character strings.
>
> There's barely any difference between the truth value of these
> statements:
>
>    Python strings are ASCII.
>
>    Python strings are Latin-1.
>
>    Python strings are Unicode.
>
> Each of those statements is true as long as you stay within the
> respective character sets, and cease to be true when your text contains
> characters outside the character sets.

The difference is that ASCII and Latin-1 cut out a large number of
active world languages, UCS-2 (the intermediate option you didn't
mention) cuts out a small proportion (by usage) of significant
characters, and Unicode cuts out only those characters which fall
under issues like Han unification. (Plus any that haven't yet been
allocated. But since Python doesn't actually validate code points to
ensure that they've been given meanings, you can use today's Python to
work with tomorrow's Unicode.)

Do you have actual text that you're unable to represent in Unicode? If
so, you are going to have major problems using it with *any* computer
system. There are Japanese encodings that can represent additional
characters, but they also *cannot* represent a lot of the other
characters we use, so there'll be fundamental incompatibilities.

> Now, it is true that Python currently limits itself to the 1,114,112
> Unicode code points. And it likely won't adopt more characters unless
> Unicode does it first. However, text is something more lofty and
> abstract than a sequence of Unicode code points.
>
> We shouldn't call strings Unicode any more than we call numbers IEEE or
> times ISO.

We don't call numbers IEEE, but if we're working with Python floats,
we *do* require all numbers to be representable as IEEE
floating-point. Don't like that? Pick decimal.Decimal instead, or
fractions.Fraction, and pick a different set of limitations... but
ultimately, you *will* have restrictions - and much tighter
restrictions than Unicode places on text.

Do you genuinely have text that you can't represent in Unicode, or are
you just arguing against Unicode to try to justify "Python strings are
<something else>" as a basis for your code?

ChrisA