python 2.7 and unicode (one more time)

Fri Nov 21 10:23:06 EST 2014

Chris Angelico wrote:

> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> (E.g. there are millions of existing files across the world containing
>> text which use legacy encodings that are not compatible with Unicode.)
> 
> Not compatible with Unicode? There aren't many character sets out
> there that include characters not in Unicode - that was the whole
> point. Of course, there are plenty of files in unspecified eight-bit
> encodings, so you may have a problem with reliable decoding - but if
> you know what the encoding is, you ought to be able to represent each
> character in Unicode.

What I meant was that some encodings -- namely ASCII and Latin-1 -- the
ordinals are exactly equivalent to Unicode, that is:

# Python 3
for i in range(128):
    assert chr(i).encode('ASCII') == bytes([i])

for i in range(256):
    assert chr(i).encode('Latin-1') == bytes([i])

That's not quite as significant as I thought, though. What is significant is
that a pure ASCII file on disk can be read by a program assuming UTF-8:

for i in range(128):
    assert chr(i).encode('UTF-8') == bytes([i])

although the same is not the case for Latin-1 encoded files.

> Not compatible with any of the UTFs, that's different. Plenty of that
> in the world.
> 
>> You are certainly correct that in it's full generality, "text" is much
>> more than just a string of code points. Unicode strings is a primitive
>> data type. A powerful and sophisticated text processing application may
>> even find Python strings too primitive, possibly needing something like
>> ropes of graphemes rather than strings of code points.
> 
> That's probably more an efficiency point, though. It should be
> possible to do a perfect two-way translation between your grapheme
> rope and a Python string; otherwise, you'll have great difficulty
> saving your file to the disk (which will normally involve representing
> the text in Unicode, then encoding that to bytes).

Well, yes. My point, agreeing with Marko, is that any time you want to do
something even vaguely related to human-readable text, "code points" are
not enough. For example, if I give a string containing the following two
code points in this order:

LATIN SMALL LETTER E
COMBINING CIRCUMFLEX ACCENT

then my application should treat that as a single "character" and display it
as:

LATIN SMALL LETTER E WITH CIRCUMFLEX

which looks like this: ê

rather than two distinct "characters" eˆ

Now, that specific example is a no-brainer, because the Unicode
normalization routines will handle the conversion. But not every
combination of accented characters has a canonical combined form. What
about something like this?

'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'

If I insert a character into my string, I want to be able to insert before
the w or after the caron, but not in the middle of those three code points.

-- 
Steven