Newbie question about text encoding

Sun Mar 8 14:30:45 EDT 2015

Rustom Mody wrote:

> On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
>> Rustom Mody wrote:
>> > This includes not just bug-prone-system code such as Java and Windows
>> > but seemingly working code such as python 3.
>> 
>> What Unicode bugs do you think Python 3.3 and above have?
> 
> Literal/Legalistic answer:
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135

Nice one :-) but not exactly in the spirit of what we're discussing (as you
acknowledge below), so I won't discuss that.

> [And already quoted at
> http://blog.languager.org/2015/03/whimsical-unicode.html
> ]
> 
> An answer more in the spirit of what I am trying to say:
> Idle3, Roy's example and in general all systems that are
> python-centric but use components outside of python that are
> unicode-broken
> 
> IOW I would expect people (at least people with good faith) reading my
> 
>> bug-prone-system code...seemingly working code such as python 3...
> 
> to interpret that NOT as
> 
> "python 3 is seemingly working but actually broken"

Why not? That is the natural interpretation of the sentence, particularly in
the context of your previous sentence:

    [quote]
    Or you can skip the blame-game and simply note the fact that 
    large segments of extant code-bases are currently in bug-prone
    or plain buggy state.

    This includes not just bug-prone-system code such as Java and
    Windows but seemingly working code such as python 3.
    [end quote]

The natural interpretation of this is that Python 3 is only *seemingly*
working, but is also an example of a code base in "bug-prone or plain buggy
state".

If that's not your intended meaning, then rather than casting aspersions on
my honesty ("good faith" indeed) you might accept that perhaps you didn't
quite manage to get your message across.

> But as
> 
> "Apps made with working system code (eg python3) can end up being broken
> because of other non-working system code - eg mysql, java, javascript,
> windows-shell, and ultimately windows, linux"

Don't forget viruses or other malware, cosmic rays, processor bugs, dry
solder joints on the motherboard, faulty memory, and user-error.

I'm not sure what point you think you are making. If you want to discuss the
fact that complex systems have more interactions than simple systems, and
therefore more ways for things to go wrong, I will agree. I'll agree that
this is an issue with Python code that interacts with other systems which
may or may not implement Unicode correctly. There are a few ways to
interpret this:

(1) You're making a general point about the complexity of modern computing.

(2) You're making the point that dealing with text encodings in general, and
Unicode in specific, is hard because of the interaction of programming
language, database, file system, locale, etc.

(3) You're implying that Python ought to fix this problem some how.

(4) You're implying that *Unicode* specifically is uniquely problematic in
this way. Or at least *unusual* to be problematic in this way.

I will agree with 1 and 2; I'll say that 3 would be nice but in the absence
of concrete proposals for how to fix it, it's just meaningless chatter. And
I'll disagree strongly with 4.

Unicode came into existence because legacy encodings suffer from similar
problems, only worse. (One major advantage of Unicode over previous
multi-byte encodings is that the UTF encodings are self-healing. A single
corrupted byte will, *at worst*, cause a single corrupted code point.)

In one sense, Unicode has solved these legacy encoding problems, in the
sense that if you always use a correct implementation of Unicode then you
won't *ever* suffer from problems like moji-bake, broken strings and so
forth.

In another sense, Unicode hasn't solved these legacy problems because we
still have to deal with files using legacy encodings, as well as standards
organisations, operating systems, developers, applications and users who
continue to produce new content using legacy encodings, buggy or incorrect
implementations of the standard, also viruses, cosmic rays, dry solder
joints and user-error. How are these things Unicode's fault or
responsibility?

-- 
Steven