python 2.7 and unicode (one more time)

Thu Nov 20 20:10:12 EST 2014

On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> (E.g. there are millions of existing files across the world containing text
> which use legacy encodings that are not compatible with Unicode.)

Not compatible with Unicode? There aren't many character sets out
there that include characters not in Unicode - that was the whole
point. Of course, there are plenty of files in unspecified eight-bit
encodings, so you may have a problem with reliable decoding - but if
you know what the encoding is, you ought to be able to represent each
character in Unicode.

Not compatible with any of the UTFs, that's different. Plenty of that
in the world.

> You are certainly correct that in it's full generality, "text" is much more
> than just a string of code points. Unicode strings is a primitive data
> type. A powerful and sophisticated text processing application may even
> find Python strings too primitive, possibly needing something like ropes of
> graphemes rather than strings of code points.

That's probably more an efficiency point, though. It should be
possible to do a perfect two-way translation between your grapheme
rope and a Python string; otherwise, you'll have great difficulty
saving your file to the disk (which will normally involve representing
the text in Unicode, then encoding that to bytes).

To be sure, a Python string is a poor representational form for a text
editor. But that's largely because it's immutable, so every little
edit would involve massive copying. Depending on what you're doing, it
might be worth using a chunked UTF-8 byte stream (allowing for
insertion at any chunk boundary), or an array of lines, or something
grapheme-based... but all of those questions are performance, not
correctness, issues.

> We Western and Northern European speakers -- and I don't know whether Finns
> are counted as Northern Europeans or Eastern Europeans -- are lucky in that
> our natural languages are well-covered by Unicode. All our graphemes are
> also code points, even the "funny ones with accents". As an English
> speaker. I have to remind myself that not every grapheme is a single code
> point, but Devanagari or Navajo writers will never make that mistake.

I've been working with different languages a bit, lately. Broadly
speaking, you have:

1) Languages which use the Roman alphabet, plus a handful of other
characters (eg Finnish, German). These can be represented largely in
ASCII, and used to be handled fairly easily with a single codepage -
an eight-bit ASCII-compatible encoding.

2) Languages which use a different alphabet (eg Cyrillic - Russian,
Bulgarian). You could possibly cram them into an eight-bit encoding
without tipping ASCII out, but I'm not sure. In Unicode, these
languages are all easily supported by the BMP, as they don't use a
huge number of characters each.

3) Languages which use a non-alphabetic system (eg Korean). I think
they're all still covered by the BMP, but there's no way you can fit
them into eight-bit encodings - one single language will use more than
256 symbols.

4) Ancient, esoteric, or symbolic writing systems. Not fundamentally
different from the above categories except that they're less used, and
the BMP has finite space. These will definitely need the SMP.

But all of them are covered by Unicode. (Sadly, they are NOT all
covered by all fonts, so I've been finding that certain pieces of text
come out as strings of little boxes. But I can at least manipulate the
text, even if I can't read it back.) I can, for example, zip lines of
text like this:

English:
Let it go, let it go!
I am one with the wind and sky
Let it go, let it go!
You'll never see me cry!

Icelandic:
Þetta er nóg, þetta er nóg
Uppi í himni eins og vindablær
Þetta er nóg, komið nóg
Og tár mín enginn sér fær

Russian:
Отпусти и забудь,
Этот мир из твоих грёз.
Отпусти и забудь,
И не будет больше слёз.

Output:
Let it go, let it go!
Þetta er nóg, þetta er nóg
Отпусти и забудь,

I am one with the wind and sky
Uppi í himni eins og vindablær
Этот мир из твоих грёз.

Let it go, let it go!
Þetta er nóg, komið nóg
Отпусти и забудь,

You'll never see me cry!
Og tár mín enginn sér fær
И не будет больше слёз.

In fact, it's trivially easy to write something like this, because all
this text is Unicode. ALL of these languages (and plenty more) are
"well-covered by Unicode". There's still the ongoing debate of Han
unification, plus the progressive work of adding characters for
ancient scripts and such, but AFAIK, all writing systems currently in
use are covered.

ChrisA