Unicode 7
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu May 1 23:15:24 EDT 2014
On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:
> "strange beasties like python's FSR"
>
> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
> is an *internal optimization* that benefits most unicode operations that
> people actually perform. It uses UTF-32 by default but adapts to the
> strings users create by compressing the internal format. The compression
> is trivial -- simple dropping leading null bytes common to all
> characters -- so each character is still readable as is.
For anyone who, like me, wasn't convinced that Unicode worked that way,
you can see for yourself that it does. You don't need Python 3.3, any
version of 3.x will work. In Python 2.7, it should work if you just
change the calls from "chr()" to "unichr()":
py> for i in range(256):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:3] == b'\0\0\0'
... assert u[3:] == c.encode('latin-1')
...
py> for i in range(256, 0xFFFF+1):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:2] == b'\0\0'
... assert u[2:] == c.encode('utf-16-be')
...
py>
So Terry is correct: dropping leading zeroes, and treating the remainder
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of
memory.
--
Steven D'Aprano
http://import-that.dreamwidth.org/
More information about the Python-list
mailing list