python 2.7 and unicode (one more time)

Fri Nov 21 10:40:43 EST 2014

On Sat, Nov 22, 2014 at 2:23 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Chris Angelico wrote:
>
>> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>>> (E.g. there are millions of existing files across the world containing
>>> text which use legacy encodings that are not compatible with Unicode.)
>>
>> Not compatible with Unicode? There aren't many character sets out
>> there that include characters not in Unicode - that was the whole
>> point. Of course, there are plenty of files in unspecified eight-bit
>> encodings, so you may have a problem with reliable decoding - but if
>> you know what the encoding is, you ought to be able to represent each
>> character in Unicode.
>
> What I meant was that some encodings -- namely ASCII and Latin-1 -- the
> ordinals are exactly equivalent to Unicode, that is:
>
> That's not quite as significant as I thought, though. What is significant is
> that a pure ASCII file on disk can be read by a program assuming UTF-8:
>
> although the same is not the case for Latin-1 encoded files.

Yep. Thing is, Unicode can't magically convert all files on all
disks... but with a good codec library, you can at least convert
things as you find them. (I was reading MacRoman files earlier this
year. THAT is an encoding I didn't expect I'd find in 2014.)

> Well, yes. My point, agreeing with Marko, is that any time you want to do
> something even vaguely related to human-readable text, "code points" are
> not enough. ... What about something like this?
>
> 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'
>
> If I insert a character into my string, I want to be able to insert before
> the w or after the caron, but not in the middle of those three code points.

Yes, which is a concern. Also a concern is the ability to detect other
boundaries, like words. None of these can be easily solved; all of
them can be dealt with by using the Unicode character data, which is
better than you get for most legacy encodings. In terms of Python
strings, it still makes sense to insert characters between those
combining characters; so what you're saying is that a text editor
widget needs to be aware of more than just code points. Which is
trivially obvious in the presence of RTL text, too; cursor positions
through differing-direction text will be an issue.

The problems you're citing aren't Unicode problems. They stem from the
complexities of human languages. Unicode just makes them a bit more
visible to English-only-speaking programmers.

ChrisA