Unicode 7

Thu May 1 18:38:35 EDT 2014

On 5/1/2014 2:04 PM, Rustom Mody wrote:

>>> Since its Unicode-troll time, here's my contribution
>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I will not comment on the Unix-assumption part, but I think you go wrong 
with this:  "Unicode is a Headache". The major headache is that unicode 
and its very few encodings are not universally used. The headache is all 
the non-unicode legacy encodings still being used. So you better title 
this section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data is 
ASCII whether its file, core, terminal, or network; ie "ABC" is 
65,66,67." Let me translate: "If all text is ASCII encoded, then text 
data is ASCII, whether ..." But it was never the case that all text was 
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
still uses the latter. Other mainframe makers used other encodings of 
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
universal. You could have just as well said "With EBCDIC, data is 
EBCDIC, whether ..."

https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers, 
including the IBM PC. The latter was considered a toy by the mainframe 
guys. If they had known that PCs would partly take over the computing 
world, they might have suggested or insisted that the it use EBCDIC.

"With unicode there are:
     encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not 
universal, all of the problems with *non-unicode* character sets and 
encodings would disappear. The pre-unicode declarations could then 
disappear. More truthful: "without unicode there are 100s of encodings 
and with unicode only 3 that we should worry about.

"in-memory formats"

These are not the concern of the using programmer as long as they do not 
introduce bugs or limitations (as do all the languages stuck on UCS-2 
and many using UTF-16, including old Python narrow builds). Using what 
should generally be the universal transmission format, UFT-8, as the 
internal format means either losing indexing and slicing, having those 
operations slow from O(1) to O(len(string)), or adding an index table 
that is not part of the unicode standard. Using UTF-32 avoids the above 
but usually wasted space -- up to 75%.

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
is an *internal optimization* that benefits most unicode operations that 
people actually perform. It uses UTF-32 by default but adapts to the 
strings users create by compressing the internal format. The compression 
is trivial -- simple dropping leading null bytes common to all 
characters -- so each character is still readable as is. The string 
headers records how many bytes are left.  Is the idea of algorithms that 
adapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user except 
for reducing space or time or maybe both. Unicode operations are 
otherwise the same as with previous wide builds. People who used to use 
narrow-builds also benefit from bug elimination. The only 'headaches' 
involved might have been those of the developers who optimized previous 
wide builds.

CPython has many other functions with special-case optimizations and 
'fast paths' for common, simple cases. For instance, (some? all?) number 
operations are optimized for pairs of integers.  Do you call these 
'strange beasties'?

PyPy is faster than CPython, when it is, because it is even more 
adaptable to particular computations by creating new fast paths. The 
mechanism to create these 'strange beasties' might have been a headache 
for the writers, but when it works, which it now seems to, it is not for 
the users.

-- 
Terry Jan Reedy