Newbie question about text encoding

Fri Mar 6 04:02:35 EST 2015

On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote:
> > My conclusion: Early adopters of unicode -- Windows and Java -- were punished
> > for their early adoption.  You can blame the unicode consortium, you can
> > blame the babel of human languages, particularly that some use characters
> > and some only (the equivalent of) what we call words.
> >
> > Or you can skip the blame-game and simply note the fact that large segments of
> > extant code-bases are currently in bug-prone or plain buggy state.
> 
> For most of the 1990s, I was writing code in REXX, on OS/2. An even
> earlier adopter, REXX didn't have Unicode support _at all_, but
> instead had facilities for working with DBCS strings. You can't get
> everything right AND be the first to produce anything. Python didn't
> make Unicode strings the default until 3.0, but that's not Unicode's
> fault.
> 
> > This includes not just bug-prone-system code such as Java and Windows but
> > seemingly working code such as python 3.
> >
> > Here is Roy's Smith post that first started me thinking that something may
> > be wrong with SMP
> > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
> >
> > Some parts are here some earlier and from my memory.
> > If details wrong please correct:
> > - 200 million records
> > - Containing 4 strings with SMP characters
> > - System made with python and mysql. SMP works with python, breaks mysql.
> >   So whole system broke due to those 4 in 200,000,000 records
> >
> > I know enough (or not enough) of unicode to be chary of statistical conclusions
> > from the above.
> > My conclusion is essentially an 'existence-proof':
> 
> Hang on hang on. Why are you blaming Python or SMP characters for
> this? The problem here is MySQL, which doesn't adequately cope with
> the full Unicode range. (Or, didn't then, or doesn't with its default
> settings. I believe you can configure current versions of MySQL to
> work correctly, though I haven't actually checked. PostgreSQL gets it
> right, that's good enough for me.)
> 
> > SMP-chars can break systems.
> > The breakage is costly-fied by the combination
> > - layman statistical assumptions
> > - BMP → SMP exercises different code-paths
> 
> Broken systems can be shown up by anything. Suppose you have a program
> that breaks when it gets a NUL character (not unknown in C code); is
> the fault with the Unicode consortium for allocating something at
> codepoint 0, or the code that can't cope with a perfectly normal
> character?

Strawman.

Lets please stick to UTF-16 shall we?

Now tell me:
- Is it broken or not?
- Is it widely used or not?
- Should programmers be careful of it or not?
- Should programmers be warned about it or not?