Python 3.2 has some deadly infection

rurpy at yahoo.com rurpy at yahoo.com
Sun Jun 8 00:34:23 EDT 2014


On 06/05/2014 05:02 PM, Steven D'Aprano wrote:
>[...]
> But Linux Unicode support is much better than Windows. Unicode support in 
> Windows is crippled by continued reliance on legacy code pages, and by 
> the assumption deep inside the Windows APIs that Unicode means "16 bit 
> characters". See, for example, the amount of space spent on fixing 
> Windows Unicode handling here:
> 
> http://www.utf8everywhere.org/

While not disagreeing with the the general premise of that page, it 
has some problems that raise doubts in my mind about taking everything 
the author says at face value.

For example

  "Q: Why would the Asians give up on UTF-16 encoding, which saves 
      them 50% the memory per character?"
  [...] in fact UTF-8 is used just as often in those [Asian] countries. 

That is not my experience, at least for Japan.  See my comments in 
  https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files 
found by Google.

He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to 
show that even without a lot of (html-mandated) ascii, the space savings 
are not very much compared to the theoretical "50%" savings he stated:

  "             Dense text (Δ UTF-8)
   UTF-8   ...     222 KB (0%)
   UTF-16  ...     176 KB (−21%)"

Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.

I did the same test using
  http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756.  That gives (using the (utf8-utf16)/utf16 metric he used 
to claim 50% idealized savings) 40% which is quite a bit closer to the 
idealized 50% than his 21%.

I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy.  IOW, just because it's on the internet doesn't 
mean it's true.



More information about the Python-list mailing list