Unicode 7

Thu May 1 23:15:24 EDT 2014

On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:

> "strange beasties like python's FSR"
> 
> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
> is an *internal optimization* that benefits most unicode operations that
> people actually perform. It uses UTF-32 by default but adapts to the
> strings users create by compressing the internal format. The compression
> is trivial -- simple dropping leading null bytes common to all
> characters -- so each character is still readable as is.

For anyone who, like me, wasn't convinced that Unicode worked that way, 
you can see for yourself that it does. You don't need Python 3.3, any 
version of 3.x will work. In Python 2.7, it should work if you just 
change the calls from "chr()" to "unichr()":

py> for i in range(256):
...     c = chr(i)
...     u = c.encode('utf-32-be')
...     assert u[:3] == b'\0\0\0'
...     assert u[3:] == c.encode('latin-1')
...
py> for i in range(256, 0xFFFF+1):
...     c = chr(i)
...     u = c.encode('utf-32-be')
...     assert u[:2] == b'\0\0'
...     assert u[2:] == c.encode('utf-16-be')
...
py> 

So Terry is correct: dropping leading zeroes, and treating the remainder 
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of 
memory.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/