Newbie question about text encoding

Sat Mar 7 06:09:35 EST 2015

Rustom Mody wrote:

> On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
[...]
>> Chris is suggesting that going from BMP to all of Unicode is not the hard
>> part. Going from ASCII to the BMP part of Unicode is the hard part. If
>> you can do that, you can go the rest of the way easily.
> 
> Depends where the going is starting from.
> I specifically names Java, Javascript, Windows... among others.
> Here's some quotes from the supplementary chars doc of Java
>
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
> 
> | Supplementary characters are characters in the Unicode standard whose
> | code points are above U+FFFF, and which therefore cannot be described as
> | single 16-bit entities such as the char data type in the Java
> | programming language. Such characters are generally rare, but some are
> | used, for example, as part of Chinese and Japanese personal names, and
> | so support for them is commonly required for government applications in
> | East Asian countries...
> 
> | The introduction of supplementary characters unfortunately makes the
> | character model quite a bit more complicated.
> 
> | Unicode was originally designed as a fixed-width 16-bit character
> | encoding. The primitive data type char in the Java programming language
> | was intended to take advantage of this design by providing a simple data
> | type that could hold
> | any character....  Version 5.0 of the J2SE is required to support
> | version 4.0 of the Unicode standard, so it has to support supplementary
> | characters.
> 
> My conclusion: Early adopters of unicode -- Windows and Java -- were
> punished
> for their early adoption.  You can blame the unicode consortium, you can
> blame the babel of human languages, particularly that some use characters
> and some only (the equivalent of) what we call words.

I see you are blaming everyone except the people actually to blame.

It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years
ago, the same year as 1.0 release of Java. Java has had eight major new
releases since then. Oracle, and Sun before them, are/were serious, tier-1,
world-class major IT companies. Why haven't they done something about
introducing proper support for Unicode in Java? It's not hard -- if Python
can do it using nothing but volunteers, Oracle can do it. They could even
do it in a backwards-compatible way, by leaving the existing APIs in place
and adding new APIs.

As for Microsoft, as a member of the Unicode Consortium they have no excuse.
But I think you exaggerate the lack of support for SMPs in Windows. Some
parts of Windows have no SMP support, but they tend to be the oldest and
less important (to Microsoft) parts, like the command prompt.

Anyone have Powershell and like to see how well it supports SMP?

This Stackoverflow question suggests that post-Windows 2000, the Windows
file system has proper support for code points in the supplementary planes:

http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua

Or maybe not.

> Or you can skip the blame-game and simply note the fact that large
> segments of extant code-bases are currently in bug-prone or plain buggy
> state.
> 
> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.

What Unicode bugs do you think Python 3.3 and above have?

>> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
>> UTF-8 and UTF-32, since that goes against the grain of the system. You
>> would have to program in artificial restrictions that otherwise don't
>> exist.
> 
> Yes  UTF-8 and UTF-32 make most of the objections to unicode 7.0
> irrelevant.

Glad you agree about that much at least.

[...]
>> Conclusion: faulty implementations of UTF-16 which incorrectly handle
>> surrogate pairs should be replaced by non-faulty implementations, or
>> changed to UTF-8 or UTF-32; incomplete Unicode implementations which
>> assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should
>> be upgraded.
> 
> Imagine for a moment a thought experiment -- we are not on a python but a
> java forum and please rewrite the above para.

There is no need to re-write it. If Java's only implementation of Unicode
assumes that code points are 16 bits only, then Java needs a new Unicode
implementation. (I assume that the existing one cannot be changed for
backwards-compatibility reasons.)

> Are you addressing the vanilla java programmer? Language implementer?
> Designer? The Java-funders -- earlier Sun, now Oracle?

The last three should be considered the same people.

The vanilla Java programmer is not responsible for the short-comings of
Java's implementation.

[...]
>> > In practice, standards change.
>> > However if a standard changes so frequently that that users have to
>> > play catching cook and keep asking: "Which version?" they are justified
>> > in asking "Are the standard-makers doing due diligence?"
>> 
>> Since Unicode has stability guarantees, and the encodings have not
>> changed in twenty years and will not change in the future, this argument
>> is bogus. Updating to a new version of the standard means, to a first
>> approximation, merely allocating some new code points which had
>> previously been undefined but are now defined.
>> 
>> (Code points can be flagged deprecated, but they will never be removed.)
> 
> Its not about new code points; its about "Fits in 2 bytes" to "Does not
> fit in 2 bytes"

I quote you again:

"if a standard changes so frequently..."

The move to more than 16 bits happened once. It happened almost 20 years
ago. In what way does this count as frequent changes?

> If you call that argument bogus I call you a non computer scientist.

I am not a computer scientist, and the argument remains bogus. Unicode does
not change "frequently", and changes are backward-compatible.

> [Essentially this is my issue with the consortium it seems to be working
> [like a bunch of linguists not computer scientists]

That's rather like complaining that some computer game looks like it was
designed by games players instead of theoreticians. "Why, people have FUN
playing this, almost like it was designed by professionals who think about
gaming!!!"

Unicode is a standard intended for the handling of human languages. It is
intended as a real-life working standard, not some theoretical toy for
academics to experiment with. It is designed to be used, not to have papers
written about it. The character set part of it has effectively been
designed by linguists, and that is a good thing. But the encoding side of
things has been designed by practising computer programmers such as Rob
Pike and Ken Thompson. You might have heard of them.

> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

There are plenty of things wrong with some implementations of Unicode, those
that assume all code points are two bytes.

There may be a few things wrong with the current Unicode standard, such as
missing characters, characters given the wrong name, and so forth.

But there's nothing wrong with the design of the SMP. It allows the great
majority of text, probably 99% or more, to use two bytes (UTF-16) or no
more than three bytes (UTF-8), while only relatively specialised uses need
four bytes for some code points.

> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
>   So whole system broke due to those 4 in 200,000,000 records

No, they broke because MySQL has buggy Unicode handling.

Bugs are not unusual. I used to have a version of Apple's Hypercard which
would lock up the whole operating system if you tried to display the
string "0^0" in a message dialog. Given that classic Mac OS was not a
proper multi-tasking OS like Unix or OS-X or even Windows, this was a real
pain. My conclusion from that is that that version of Hypercard was buggy.
What is your conclusion?

> I know enough (or not enough) of unicode to be chary of statistical
> conclusions from the above.
> My conclusion is essentially an 'existence-proof':
> 
> SMP-chars can break systems.

Oh come on. How about this instead?

X can break systems, for every conceivable value of X.

> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths
> 
> It is necessary but not sufficient to test print "hello world" in ASCII,
> BMP, SMP. You also have to write the hello world in the database -- mysql
> Read it from the webform -- javascript
> etc etc

Yes. This is called "integration testing". That's what professionals do.

> You could also choose do with "astral crap" (Roy's words) what we all do
> with crap -- throw it out as early as possible.

And when Roy's customers demand that his product support emoji, or complain
that they cannot spell their own name because of his parochial and ignorant
idea of "crap", perhaps he will consider doing what he should have done
from the beginning:

Stop using MySQL, which is a joke of a database[1], and use Postgres which
does not have this problem.

[1] So I have been told.

-- 
Steven