python 2.7 and unicode (one more time)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Nov 21 10:23:06 EST 2014
Chris Angelico wrote:
> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> (E.g. there are millions of existing files across the world containing
>> text which use legacy encodings that are not compatible with Unicode.)
>
> Not compatible with Unicode? There aren't many character sets out
> there that include characters not in Unicode - that was the whole
> point. Of course, there are plenty of files in unspecified eight-bit
> encodings, so you may have a problem with reliable decoding - but if
> you know what the encoding is, you ought to be able to represent each
> character in Unicode.
What I meant was that some encodings -- namely ASCII and Latin-1 -- the
ordinals are exactly equivalent to Unicode, that is:
# Python 3
for i in range(128):
assert chr(i).encode('ASCII') == bytes([i])
for i in range(256):
assert chr(i).encode('Latin-1') == bytes([i])
That's not quite as significant as I thought, though. What is significant is
that a pure ASCII file on disk can be read by a program assuming UTF-8:
for i in range(128):
assert chr(i).encode('UTF-8') == bytes([i])
although the same is not the case for Latin-1 encoded files.
> Not compatible with any of the UTFs, that's different. Plenty of that
> in the world.
>
>> You are certainly correct that in it's full generality, "text" is much
>> more than just a string of code points. Unicode strings is a primitive
>> data type. A powerful and sophisticated text processing application may
>> even find Python strings too primitive, possibly needing something like
>> ropes of graphemes rather than strings of code points.
>
> That's probably more an efficiency point, though. It should be
> possible to do a perfect two-way translation between your grapheme
> rope and a Python string; otherwise, you'll have great difficulty
> saving your file to the disk (which will normally involve representing
> the text in Unicode, then encoding that to bytes).
Well, yes. My point, agreeing with Marko, is that any time you want to do
something even vaguely related to human-readable text, "code points" are
not enough. For example, if I give a string containing the following two
code points in this order:
LATIN SMALL LETTER E
COMBINING CIRCUMFLEX ACCENT
then my application should treat that as a single "character" and display it
as:
LATIN SMALL LETTER E WITH CIRCUMFLEX
which looks like this: ê
rather than two distinct "characters" eˆ
Now, that specific example is a no-brainer, because the Unicode
normalization routines will handle the conversion. But not every
combination of accented characters has a canonical combined form. What
about something like this?
'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'
If I insert a character into my string, I want to be able to insert before
the w or after the caron, but not in the middle of those three code points.
--
Steven
More information about the Python-list
mailing list