[Python-Dev] Re: Unicode character names

M.-A. Lemburg mal@lemburg.com
Fri, 24 Mar 2000 09:52:36 +0100


Bill Tutt wrote:
> 
> MAL wrote:
> 
> >Andrew M. Kuchling" wrote:
> >>
> >> Paul Prescod writes:
> >>>The new \N escape interpolates named characters within strings. For
> >>>example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a
> >>>unicode smiley face at the end.
> >>
> >> Cute idea, and it certainly means you can avoid looking up Unicode
> >> numbers.  (You can look up names instead. :) )  Note that this means the
> >> Unicode database is no longer optional if this is done; it has to be
> >> around at code-parsing time.  Python could import it automatically, as
> >> exceptions.py is imported.  Christian's work on compressing
> >> unicodedatabase.c is therefore really important.  (Is Perl5.6 actually
> >> dragging around the Unicode database in the binary, or is it read out
> >> of some external file or data structure?)
> >
> > Sorry to disappoint you guys, but the Unicode name and comments
> > are *not* included in the unicodedatabase.c file Christian
> > is currently working on. The reason is simple: it would add
> > huge amounts of string data to the file. So this is a no-no
> > for the core distribution...
> >
> 
> Ok, now you're just being silly. Its possible to put the character names in
> a separate structure so that they don't automatically get paged in with the
> normal unicode character property data. If you never use it, it won't get
> paged in, its that simple....

Sure, but it would still cause the interpreter binary or DLL
to increase in size considerably... that caused some major
noise a few days ago due to the fact that the unicodedata module
adds some 600kB to the interpreter -- even though it would
only get swapped in when needed (the interpreter itself doesn't
use it).
 
> Looking up the Unicode code value from the Unicode character name smells
> like a good time to use gperf to generate a perfect hash function for the
> character names. Esp. for the Unicode 3.0 character namespace. Then you can
> just store the hashkey -> Unicode character mapping, and hardly ever need to
> page in the actual full character name string itself.

Great idea, but why not put this into separate codec module ?
 
> I haven't looked at what the comment field contains, so I have no idea how
> useful that info is.

Probably not worth looking at...
 
> *waits while gperf crunches through the ~10,550 Unicode characters where
> this would be useful*

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/