[Python-Dev] Unicode character names

Christian Tismer tismer@tismer.com
Fri, 24 Mar 2000 14:13:02 +0100


"M.-A. Lemburg" wrote:
> 
> "Andrew M. Kuchling" wrote:
> >
> > Paul Prescod writes:
> > >The new \N escape interpolates named characters within strings. For
> > >example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a
> > >unicode smiley face at the end.
> >
> > Cute idea, and it certainly means you can avoid looking up Unicode
> > numbers.  (You can look up names instead. :) )  Note that this means the
> > Unicode database is no longer optional if this is done; it has to be
> > around at code-parsing time.  Python could import it automatically, as
> > exceptions.py is imported.  Christian's work on compressing
> > unicodedatabase.c is therefore really important.  (Is Perl5.6 actually
> > dragging around the Unicode database in the binary, or is it read out
> > of some external file or data structure?)
> 
> Sorry to disappoint you guys, but the Unicode name and comments
> are *not* included in the unicodedatabase.c file Christian
> is currently working on. The reason is simple: it would add
> huge amounts of string data to the file. So this is a no-no
> for the core distribution...

This is not settled, still an open question.
What I have for non-textual data:
25 kb with dumb compression
15 kb with enhanced compression

What amounts of data am I talking about?
- The whole unicode database text file has size 
  632 kb.
- With PkZip this goes down to 
  96 kb.

Now, I produced another text file with just the currently
used data in it, and this sounds so:
- the stripped unicode text file has size
  216 kb.
- PkZip melts this down to
  40 kb.

Please compare that to my results above: I can do at least
twice as good. I hope I can compete for the text sections
as well (since this is something where zip is *good* at),
but just let me try.
Let's target 60 kb for the whole crap, and I'd be very pleased.

Then, there is still the question where to put the data.
Having one file in the dll and another externally would
be an option. I could also imagine to use a binary external
file all the time, with maximum possible compression.
By loading this structure, this would be partially expanded
to make it fast.
An advantage is that the compressed Unicode database
could become a stand-alone product. The size is in fact
so crazy small, that I'd like to make this available
to any other language.

> Still, the above is easily possible by inventing a new
> encoding, say unicode-with-smileys, which then reads in
> a file containing the Unicode names and applies the necessary
> magic to decode/encode data as Paul described above.

That sounds reasonable. Compression makes sense as well here,
since the expanded stuff makes quite an amount of kb, compared
to what it is "worth", compared to, say, the Python dll.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home