Unicode 7

Rustom Mody rustompmody at gmail.com
Thu May 1 22:29:55 EDT 2014


On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote:
> On 5/1/2014 2:04 PM, Rustom Mody wrote:

> >>> Since its Unicode-troll time, here's my contribution
> >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

> I will not comment on the Unix-assumption part, but I think you go wrong 
> with this:  "Unicode is a Headache". The major headache is that unicode 
> and its very few encodings are not universally used. The headache is all 
> the non-unicode legacy encodings still being used. So you better title 
> this section 'Non-Unicode is a Headache'.

> The first sentence is this misleading tautology: "With ASCII, data is 
> ASCII whether its file, core, terminal, or network; ie "ABC" is 
> 65,66,67." Let me translate: "If all text is ASCII encoded, then text 
> data is ASCII, whether ..." But it was never the case that all text was 
> ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
> still uses the latter. Other mainframe makers used other encodings of 
> A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
> universal. You could have just as well said "With EBCDIC, data is 
> EBCDIC, whether ..."

> https://en.wikipedia.org/wiki/Ascii
> https://en.wikipedia.org/wiki/EBCDIC

> A crucial step in the spread of Ascii was its use for microcomputers, 
> including the IBM PC. The latter was considered a toy by the mainframe 
> guys. If they had known that PCs would partly take over the computing 
> world, they might have suggested or insisted that the it use EBCDIC.

> "With unicode there are:
>      encodings"
> where 'encodings' is linked to
> https://en.wikipedia.org/wiki/Character_encodings_in_HTML

> If html 'always' used utf-8 (like xml), as has become common but not 
> universal, all of the problems with *non-unicode* character sets and 
> encodings would disappear. The pre-unicode declarations could then 
> disappear. More truthful: "without unicode there are 100s of encodings 
> and with unicode only 3 that we should worry about.

> "in-memory formats"

> These are not the concern of the using programmer as long as they do not 
> introduce bugs or limitations (as do all the languages stuck on UCS-2 
> and many using UTF-16, including old Python narrow builds). Using what 
> should generally be the universal transmission format, UFT-8, as the 
> internal format means either losing indexing and slicing, having those 
> operations slow from O(1) to O(len(string)), or adding an index table 
> that is not part of the unicode standard. Using UTF-32 avoids the above 
> but usually wasted space -- up to 75%.

> "strange beasties like python's FSR"

> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
> is an *internal optimization* that benefits most unicode operations that 
> people actually perform. It uses UTF-32 by default but adapts to the 
> strings users create by compressing the internal format. The compression 
> is trivial -- simple dropping leading null bytes common to all 
> characters -- so each character is still readable as is. The string 
> headers records how many bytes are left.  Is the idea of algorithms that 
> adapt to inputs really strange to you?

> Like good adaptive algorthms, the FSR is invisible to the user except 
> for reducing space or time or maybe both. Unicode operations are 
> otherwise the same as with previous wide builds. People who used to use 
> narrow-builds also benefit from bug elimination. The only 'headaches' 
> involved might have been those of the developers who optimized previous 
> wide builds.

> CPython has many other functions with special-case optimizations and 
> 'fast paths' for common, simple cases. For instance, (some? all?) number 
> operations are optimized for pairs of integers.  Do you call these 
> 'strange beasties'?

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?

JFTR I have no issue with FSR.  What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:

"Why should I pay more for a EURO sign than a $ sign?"

Well perhaps that is more related to the exchange rate than to python!



More information about the Python-list mailing list