Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Mar 5 09:06:01 EST 2015


Rustom Mody wrote:

> On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
>> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
>> >
>> > It lists some examples of software that somehow break/goof going from
>> > BMP-only unicode to 7.0 unicode.
>> >
>> > IOW the suggestion is that the the two-way classification
>> > - ASCII
>> > - Unicode
>> >
>> > is less useful and accurate than the 3-way
>> >
>> > - ASCII
>> > - BMP
>> > - Unicode
>> 
>> How is that more useful? Aside from storage optimizations (in which
>> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
>> not significantly different from the rest of Unicode.
> 
> Sorry... Dont understand.

Chris is suggesting that going from BMP to all of Unicode is not the hard
part. Going from ASCII to the BMP part of Unicode is the hard part. If you
can do that, you can go the rest of the way easily.

I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
and UTF-32, since that goes against the grain of the system. You would have
to program in artificial restrictions that otherwise don't exist.

UTF-16 is different, and that's probably why you think supporting all of
Unicode is hard. With UTF-16, there really is an obvious distinction
between the BMP and the SMP: that's where you jump from a single 2-byte
unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
or UTF-32: 

- In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
  support the SMP or not doesn't change the fact that you have to deal
  with multi-byte characters.

- In UTF-32, everything is fixed-width whether it is in the BMP or not.

In both cases, supporting the SMPs is no harder than supporting the BMP.
It's only UTF-16 that makes the SMP seem hard.

Conclusion: faulty implementations of UTF-16 which incorrectly handle
surrogate pairs should be replaced by non-faulty implementations, or
changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
upgraded.

Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
standard that is just like obsolete Unicode version 1.

Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
existing languages, let alone all the code points and characters that are
used in human communication.


>> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
>> do you keep talking about 7.0 as if it's a recent change?
> 
> It is 2015 as of now. 7.0 is the current standard.
> 
> The need for the adjective 'current' should be pondered upon.

What's your point?

The UTF encodings have not changed since they were first introduced. They
have been stable for at least twenty years: UTF-8 has existed since 1993,
and UTF-16 since 1996.

Since version 2.0 of Unicode in 1996, the standard has made "stability
guarantees" that no code points will be renamed or removed. Consequently,
there has only been one version which removed characters, version 1.1.
Since then, new versions of the standard have only added characters, never
moved, renamed or deleted them.

http://unicode.org/policies/stability_policy.html

Some highlights in Unicode history:

Unicode 1.0 (1991): initial version, defined 7161 code points.

In January 1993, Rob Pike and Ken Thompson announced the design and working
implementation of the UTF-8 encoding.

1.1 (1993): defined 34233 characters, finalised Han Unification. Removed
some characters from the 1.0 set. This is the first and only time any code
points have been removed.

2.0 (1996): First version to include code points in the Supplementary
Multilingual Planes. Defined 38950 code points. Introduced the UTF-16
encoding.

3.1 (2001): Defined 94205 code points, including 42711 additional Han
ideographs, bringing the total number of CJK code points alone to 71793,
too many to fit in 16 bits.

2006: The People's Republic Of China mandates support for the GB-18030
character set for all software products sold in the PRC. GB-18030 supports
the entire Unicode range, include the SMPs. Since this date, all software
sold in China must support the SMPs.

6.0 (2010): The first emoji or emoticons were added to Unicode.

7.0 (2014): 113021 code points defined in total.


> In practice, standards change.
> However if a standard changes so frequently that that users have to play
> catching cook and keep asking: "Which version?" they are justified in
> asking "Are the standard-makers doing due diligence?"

Since Unicode has stability guarantees, and the encodings have not changed
in twenty years and will not change in the future, this argument is bogus.
Updating to a new version of the standard means, to a first approximation,
merely allocating some new code points which had previously been undefined
but are now defined.

(Code points can be flagged deprecated, but they will never be removed.)



-- 
Steven




More information about the Python-list mailing list