[Patches] Unicode Character Name codec support

M.-A. Lemburg mal@lemburg.com
Fri, 12 May 2000 12:49:39 +0200


Bill Tutt wrote:
> 
> > I'm not sure about the copyright restrictions on
> > the UnicodeData.txt file -- I think it's better to leave
> > it out of the Python distribution (I noticed you didn't
> > mention it under "New files", so perhaps you've already
> > considered this).
> >
> 
> I don't know off the top of my head either. Anything not listed under New
> Files isn't strictly necessary for the code to work, it was just used in
> generating _ucn.c.

Ok. Perhaps putting a URL to the file into _ucn.c or the
generator script would do.
 
> > The perfect_hash tools sure look interesting, BTW. Wouldn't
> > they be a good candidate for the Tools/ subdir ?
> 
> Probably, atm perfect_hash.py is only useful with UnicodeData.txt as its
> input.
> That was just done to simplify my work and get _ucn.c working. :)
> The oddest thing about perfect_hash.py's UnicodeData only way of thinking is
> that Unicode Character Names are case insensitive, as the definition of f1,
> and f2 will attest.
> 
> I was simply amazed at how fast perfect_hash.py combined with Python's
> string hashing function converged on a perfect hash table. (It did take a
> nice long while running perfect_hash until I found the magic random number
> seed that allowed a 1.79 multiple instead of a 1.9 multiple :) )

Looks like the algorithm used does a good job :-)
 
> > I guess they would have to be tweaked a little to allow using
> > them without having to modify the internals like you did. The
> > Asian codecs could probably make some good use of these
> > utilities too.
> >
> 
> Yes, I'm sure it would. Feel free to take what I did with perfect_hash.py
> and run with it. :)
>
> > Would the perfhash.c module be usable for all hash modules
> > generated by perfect_hash.py ?
> >
> 
> Yes. perfhash.c's sole purpose in life is to calculate x's initial value in
> f1 and f2. Its applicable to any incoming dataset.

So those two modules would make a great tool set... I wish I
had more time to look into these.
 
> > The tables generated by perfect_hash.py could be too
> > large for some compilers (also it would probably be
> > a good idea providing the array size -- another source
> > of compiler warnings). The unicodedatabase module
> > had the same problem and I solved it by breaking the
> > tables into pages which are accessed through a small
> > utility function (see Modules/unicodedata*.c).
> >
> 
> Quite possibly, although IIRC the arrays that _ucn.c has are quite smaller
> than the unicodedatabase module has and so I'm relunctant to do that until
> someone actually complains. For the generic version of perfect_hash.py you
> were referring to it'd be preferrable to know what the acceptable sizes of
> the arrays actually are.

Ok. I did it the same way ;-)
 
> I put the code at the beginning of _ucn.c so that MSVC would find the code
> at line #s < 64k so that it would generate debugging information for the
> code.
> (MSVC stops emitting debug info after line 65,536. :( )
> 
> Any thoughts on replacing the unicode-escape stuff with this?

I'd rather not: unicode-escape is needed by the compiler
and that would mean having link the hash table to the
interpreter.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/