Unicode 7
Rustom Mody
rustompmody at gmail.com
Thu May 1 22:29:55 EDT 2014
On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote:
> On 5/1/2014 2:04 PM, Rustom Mody wrote:
> >>> Since its Unicode-troll time, here's my contribution
> >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
> I will not comment on the Unix-assumption part, but I think you go wrong
> with this: "Unicode is a Headache". The major headache is that unicode
> and its very few encodings are not universally used. The headache is all
> the non-unicode legacy encodings still being used. So you better title
> this section 'Non-Unicode is a Headache'.
> The first sentence is this misleading tautology: "With ASCII, data is
> ASCII whether its file, core, terminal, or network; ie "ABC" is
> 65,66,67." Let me translate: "If all text is ASCII encoded, then text
> data is ASCII, whether ..." But it was never the case that all text was
> ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
> still uses the latter. Other mainframe makers used other encodings of
> A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
> universal. You could have just as well said "With EBCDIC, data is
> EBCDIC, whether ..."
> https://en.wikipedia.org/wiki/Ascii
> https://en.wikipedia.org/wiki/EBCDIC
> A crucial step in the spread of Ascii was its use for microcomputers,
> including the IBM PC. The latter was considered a toy by the mainframe
> guys. If they had known that PCs would partly take over the computing
> world, they might have suggested or insisted that the it use EBCDIC.
> "With unicode there are:
> encodings"
> where 'encodings' is linked to
> https://en.wikipedia.org/wiki/Character_encodings_in_HTML
> If html 'always' used utf-8 (like xml), as has become common but not
> universal, all of the problems with *non-unicode* character sets and
> encodings would disappear. The pre-unicode declarations could then
> disappear. More truthful: "without unicode there are 100s of encodings
> and with unicode only 3 that we should worry about.
> "in-memory formats"
> These are not the concern of the using programmer as long as they do not
> introduce bugs or limitations (as do all the languages stuck on UCS-2
> and many using UTF-16, including old Python narrow builds). Using what
> should generally be the universal transmission format, UFT-8, as the
> internal format means either losing indexing and slicing, having those
> operations slow from O(1) to O(len(string)), or adding an index table
> that is not part of the unicode standard. Using UTF-32 avoids the above
> but usually wasted space -- up to 75%.
> "strange beasties like python's FSR"
> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
> is an *internal optimization* that benefits most unicode operations that
> people actually perform. It uses UTF-32 by default but adapts to the
> strings users create by compressing the internal format. The compression
> is trivial -- simple dropping leading null bytes common to all
> characters -- so each character is still readable as is. The string
> headers records how many bytes are left. Is the idea of algorithms that
> adapt to inputs really strange to you?
> Like good adaptive algorthms, the FSR is invisible to the user except
> for reducing space or time or maybe both. Unicode operations are
> otherwise the same as with previous wide builds. People who used to use
> narrow-builds also benefit from bug elimination. The only 'headaches'
> involved might have been those of the developers who optimized previous
> wide builds.
> CPython has many other functions with special-case optimizations and
> 'fast paths' for common, simple cases. For instance, (some? all?) number
> operations are optimized for pairs of integers. Do you call these
> 'strange beasties'?
Here is an instance of someone who would like a certain optimization to be
dis-able-able
https://mail.python.org/pipermail/python-list/2014-February/667169.html
To the best of my knowledge its nothing to do with unicode or with jmf.
Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?
JFTR I have no issue with FSR. What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]
I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:
"Why should I pay more for a EURO sign than a $ sign?"
Well perhaps that is more related to the exchange rate than to python!
More information about the Python-list
mailing list