[Patches] Unicode Character Name codec support

Bill Tutt billtut@microsoft.com
Fri, 12 May 2000 02:05:12 -0700


> From: M.-A. Lemburg [mailto:mal@lemburg.com]
> 
> 
> Bill Tutt wrote:
> > 
> > This is all based on AMK's perfect_hash work and MAL's 
> unicode-escape
> > decoding, so hopefully a wet signature isn't necessary.
> > 
> > This unicode-named codec also handles normal 
> unicode-escapes since codecs
> > are not easily and usefully stackable, and not stacking them is more
> > effecient in any event.
> > 
> > An altertnative to this approach would be to stick the data 
> in one .c file,
> > and move PyUnicode_DecodeNamedUnicodeEscape into the 
> unicode-escape code.
> > 
> > Just as an informational matter, the hash table is 1.79 
> times bigger than
> > the # of unicode characters that have names.
> > 
> > Attached in the zip file are the requisite changes:
> > 
> > Files:
> > patch.txt:
> > Contains changes to the existing files in CVS for the 
> following things:
> > Adds _ucn.c into the build gunk
> > Trivial patch to pcbuild.dsw not included on purpose since 
> my Visual Studio
> > was making more changes than made me comfortable.
> > 
> > New files:
> > _ucn.c: Already generated file, should just drop into Modules\
> > _ucn.dsp: MSVC project file, drop into PCBuild, and insert 
> into pcbuild.dsw,
> > and create a dependancy on python16.
> > test_ucn.py: Drop in Lib\test
> > unicode_named.py: Codec file, should be dropped into Lib\encodings
> > 
> > The following files are provided for informational purposes and as a
> > mechanism to explain how this was generated:
> > Suggestions of how or if this should be included in 
> Python's build process
> > are greatly appreciated.
> > 
> > perfect_hash.py: A tweaked copy of AMK's perfect_hash.py 
> that sucks in
> > UnicodeData.txt and generates _ucn.c.
> > perfhash.c:      A helper module for perfect_hash.py. This 
> just lets the
> > generated C code be more effecient than AMK's original code.
> > UnicodeData.txt: Input file for perfect_hash.py
> > 
> > Usage of perfect_hash.py:
> > perfect_hash.py UnicodeData.txt > _ucn.c
> 
> Great work, Bill :-)
> 
> Some questions:
> 
> I'm not sure about the copyright restrictions on
> the UnicodeData.txt file -- I think it's better to leave
> it out of the Python distribution (I noticed you didn't
> mention it under "New files", so perhaps you've already
> considered this).
> 

I don't know off the top of my head either. Anything not listed under New
Files isn't strictly necessary for the code to work, it was just used in
generating _ucn.c.

> The perfect_hash tools sure look interesting, BTW. Wouldn't
> they be a good candidate for the Tools/ subdir ? 

Probably, atm perfect_hash.py is only useful with UnicodeData.txt as its
input.
That was just done to simplify my work and get _ucn.c working. :)
The oddest thing about perfect_hash.py's UnicodeData only way of thinking is
that Unicode Character Names are case insensitive, as the definition of f1,
and f2 will attest.

I was simply amazed at how fast perfect_hash.py combined with Python's
string hashing function converged on a perfect hash table. (It did take a
nice long while running perfect_hash until I found the magic random number
seed that allowed a 1.79 multiple instead of a 1.9 multiple :) )

> I guess they would have to be tweaked a little to allow using
> them without having to modify the internals like you did. The
> Asian codecs could probably make some good use of these
> utilities too.
> 

Yes, I'm sure it would. Feel free to take what I did with perfect_hash.py
and run with it. :)

> Would the perfhash.c module be usable for all hash modules
> generated by perfect_hash.py ?
> 

Yes. perfhash.c's sole purpose in life is to calculate x's initial value in
f1 and f2. Its applicable to any incoming dataset.
 
> The tables generated by perfect_hash.py could be too
> large for some compilers (also it would probably be
> a good idea providing the array size -- another source
> of compiler warnings). The unicodedatabase module
> had the same problem and I solved it by breaking the
> tables into pages which are accessed through a small
> utility function (see Modules/unicodedata*.c).
> 

Quite possibly, although IIRC the arrays that _ucn.c has are quite smaller
than the unicodedatabase module has and so I'm relunctant to do that until
someone actually complains. For the generic version of perfect_hash.py you
were referring to it'd be preferrable to know what the acceptable sizes of
the arrays actually are.

I put the code at the beginning of _ucn.c so that MSVC would find the code
at line #s < 64k so that it would generate debugging information for the
code.
(MSVC stops emitting debug info after line 65,536. :( )

Any thoughts on replacing the unicode-escape stuff with this?

Bill