Newbie question about text encoding

Thu Mar 5 23:53:08 EST 2015

On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
> >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
> >> >
> >> > It lists some examples of software that somehow break/goof going from
> >> > BMP-only unicode to 7.0 unicode.
> >> >
> >> > IOW the suggestion is that the the two-way classification
> >> > - ASCII
> >> > - Unicode
> >> >
> >> > is less useful and accurate than the 3-way
> >> >
> >> > - ASCII
> >> > - BMP
> >> > - Unicode
> >> 
> >> How is that more useful? Aside from storage optimizations (in which
> >> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
> >> not significantly different from the rest of Unicode.
> > 
> > Sorry... Dont understand.
> 
> Chris is suggesting that going from BMP to all of Unicode is not the hard
> part. Going from ASCII to the BMP part of Unicode is the hard part. If you
> can do that, you can go the rest of the way easily.

Depends where the going is starting from.
I specifically names Java, Javascript, Windows... among others.
Here's some quotes from the supplementary chars doc of Java
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

| Supplementary characters are characters in the Unicode standard whose code
| points are above U+FFFF, and which therefore cannot be described as single 
| 16-bit entities such as the char data type in the Java programming language. 
| Such characters are generally rare, but some are used, for example, as part 
| of Chinese and Japanese personal names, and so support for them is commonly 
| required for government applications in East Asian countries...

| The introduction of supplementary characters unfortunately makes the 
| character model quite a bit more complicated. 

| Unicode was originally designed as a fixed-width 16-bit character encoding. 
| The primitive data type char in the Java programming language was intended to 
| take advantage of this design by providing a simple data type that could hold 
| any character....  Version 5.0 of the J2SE is required to support version 4.0 
| of the Unicode standard, so it has to support supplementary characters. 

My conclusion: Early adopters of unicode -- Windows and Java -- were punished
for their early adoption.  You can blame the unicode consortium, you can
blame the babel of human languages, particularly that some use characters
and some only (the equivalent of) what we call words.

Or you can skip the blame-game and simply note the fact that large segments of
extant code-bases are currently in bug-prone or plain buggy state.

This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.
> 
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
> and UTF-32, since that goes against the grain of the system. You would have
> to program in artificial restrictions that otherwise don't exist.

Yes  UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant.
Large segments of the
> 
> UTF-16 is different, and that's probably why you think supporting all of
> Unicode is hard. With UTF-16, there really is an obvious distinction
> between the BMP and the SMP: that's where you jump from a single 2-byte
> unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
> or UTF-32: 
> 
> - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
>   support the SMP or not doesn't change the fact that you have to deal
>   with multi-byte characters.
> 
> - In UTF-32, everything is fixed-width whether it is in the BMP or not.
> 
> In both cases, supporting the SMPs is no harder than supporting the BMP.
> It's only UTF-16 that makes the SMP seem hard.
> 
> Conclusion: faulty implementations of UTF-16 which incorrectly handle
> surrogate pairs should be replaced by non-faulty implementations, or
> changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
> that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
> upgraded.

Imagine for a moment a thought experiment -- we are not on a python but a java
forum and please rewrite the above para.
Are you addressing the vanilla java programmer? Language implementer? Designer?
The Java-funders -- earlier Sun, now Oracle?
> 
> Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
> standard that is just like obsolete Unicode version 1.
> 
> Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
> existing languages, let alone all the code points and characters that are
> used in human communication.
> 
> 
> >> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
> >> do you keep talking about 7.0 as if it's a recent change?
> > 
> > It is 2015 as of now. 7.0 is the current standard.
> > 
> > The need for the adjective 'current' should be pondered upon.
> 
> What's your point?
> 
> The UTF encodings have not changed since they were first introduced. They
> have been stable for at least twenty years: UTF-8 has existed since 1993,
> and UTF-16 since 1996.
> 
> Since version 2.0 of Unicode in 1996, the standard has made "stability
> guarantees" that no code points will be renamed or removed. Consequently,
> there has only been one version which removed characters, version 1.1.
> Since then, new versions of the standard have only added characters, never
> moved, renamed or deleted them.
> 
> http://unicode.org/policies/stability_policy.html
> 
> Some highlights in Unicode history:
> 
> Unicode 1.0 (1991): initial version, defined 7161 code points.
> 
> In January 1993, Rob Pike and Ken Thompson announced the design and working
> implementation of the UTF-8 encoding.
> 
> 1.1 (1993): defined 34233 characters, finalised Han Unification. Removed
> some characters from the 1.0 set. This is the first and only time any code
> points have been removed.
> 
> 2.0 (1996): First version to include code points in the Supplementary
> Multilingual Planes. Defined 38950 code points. Introduced the UTF-16
> encoding.
> 
> 3.1 (2001): Defined 94205 code points, including 42711 additional Han
> ideographs, bringing the total number of CJK code points alone to 71793,
> too many to fit in 16 bits.
> 
> 2006: The People's Republic Of China mandates support for the GB-18030
> character set for all software products sold in the PRC. GB-18030 supports
> the entire Unicode range, include the SMPs. Since this date, all software
> sold in China must support the SMPs.
> 
> 6.0 (2010): The first emoji or emoticons were added to Unicode.
> 
> 7.0 (2014): 113021 code points defined in total.
> 
> 
> > In practice, standards change.
> > However if a standard changes so frequently that that users have to play
> > catching cook and keep asking: "Which version?" they are justified in
> > asking "Are the standard-makers doing due diligence?"
> 
> Since Unicode has stability guarantees, and the encodings have not changed
> in twenty years and will not change in the future, this argument is bogus.
> Updating to a new version of the standard means, to a first approximation,
> merely allocating some new code points which had previously been undefined
> but are now defined.
> 
> (Code points can be flagged deprecated, but they will never be removed.)

Its not about new code points; its about "Fits in 2 bytes" to "Does not fit in 2 bytes"

If you call that argument bogus I call you a non computer scientist.
[Essentially this is my issue with the consortium it seems to be working like
a bunch of linguists not computer scientists]

Here is Roy's Smith post that first started me thinking that something may
be wrong with SMP
https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

Some parts are here some earlier and from my memory.
If details wrong please correct:
- 200 million records
- Containing 4 strings with SMP characters
- System made with python and mysql. SMP works with python, breaks mysql.
  So whole system broke due to those 4 in 200,000,000 records

I know enough (or not enough) of unicode to be chary of statistical conclusions 
from the above.
My conclusion is essentially an 'existence-proof':

SMP-chars can break systems.
The breakage is costly-fied by the combination
- layman statistical assumptions
- BMP → SMP exercises different code-paths

It is necessary but not sufficient to test print "hello world" in ASCII, BMP, SMP.
You also have to write the hello world in the database -- mysql
Read it from the webform -- javascript 
etc etc

You could also choose do with "astral crap" (Roy's words) what we all do with
crap -- throw it out as early as possible.