[Python-Dev] How about braindead Unicode "compression"?

Sun, 24 Sep 2000 14:47:11 -0400

unicodedatabase.c has 64K lines of the form:

/* U+009a */ { 13, 0, 15, 0, 0 },

Each struct getting initialized there takes 8 bytes on most machines (4
unsigned chars + a char*).

However, there are only 3,567 unique structs (54,919 of them are all 0's!).
So a braindead-easy mechanical "compression" scheme would simply be to
create one vector with the 3,567 unique structs, and replace the 64K record
constructors with 2-byte indices into that vector.  Data size goes down from

    64K * 8b = 512Kb

to

    3567 * 8b + 64K * 2b ~= 156Kb

at once; the source-code transformation is easy to do via a Python program;
the compiler warnings on my platform (due to unicodedatabase.c's sheer size)
can go away; and one indirection is added to access (which remains utterly
uniform).

Previous objections to compression were, as far as I could tell, based on
fear of elaborate schemes that rendered the code unreadable and the access
code excruciating.  But if we can get more than a factor of 3 with little
work and one new uniform indirection, do people still object?

If nobody objects by the end of today, I intend to do it.