[Patches] Unicode Character Name codec support

M.-A. Lemburg mal@lemburg.com
Fri, 12 May 2000 10:19:47 +0200


Bill Tutt wrote:
> 
> This is all based on AMK's perfect_hash work and MAL's unicode-escape
> decoding, so hopefully a wet signature isn't necessary.
> 
> This unicode-named codec also handles normal unicode-escapes since codecs
> are not easily and usefully stackable, and not stacking them is more
> effecient in any event.
> 
> An altertnative to this approach would be to stick the data in one .c file,
> and move PyUnicode_DecodeNamedUnicodeEscape into the unicode-escape code.
> 
> Just as an informational matter, the hash table is 1.79 times bigger than
> the # of unicode characters that have names.
> 
> Attached in the zip file are the requisite changes:
> 
> Files:
> patch.txt:
> Contains changes to the existing files in CVS for the following things:
> Adds _ucn.c into the build gunk
> Trivial patch to pcbuild.dsw not included on purpose since my Visual Studio
> was making more changes than made me comfortable.
> 
> New files:
> _ucn.c: Already generated file, should just drop into Modules\
> _ucn.dsp: MSVC project file, drop into PCBuild, and insert into pcbuild.dsw,
> and create a dependancy on python16.
> test_ucn.py: Drop in Lib\test
> unicode_named.py: Codec file, should be dropped into Lib\encodings
> 
> The following files are provided for informational purposes and as a
> mechanism to explain how this was generated:
> Suggestions of how or if this should be included in Python's build process
> are greatly appreciated.
> 
> perfect_hash.py: A tweaked copy of AMK's perfect_hash.py that sucks in
> UnicodeData.txt and generates _ucn.c.
> perfhash.c:      A helper module for perfect_hash.py. This just lets the
> generated C code be more effecient than AMK's original code.
> UnicodeData.txt: Input file for perfect_hash.py
> 
> Usage of perfect_hash.py:
> perfect_hash.py UnicodeData.txt > _ucn.c

Great work, Bill :-)

Some questions:

I'm not sure about the copyright restrictions on
the UnicodeData.txt file -- I think it's better to leave
it out of the Python distribution (I noticed you didn't
mention it under "New files", so perhaps you've already
considered this).

The perfect_hash tools sure look interesting, BTW. Wouldn't
they be a good candidate for the Tools/ subdir ? 
I guess they would have to be tweaked a little to allow using
them without having to modify the internals like you did. The
Asian codecs could probably make some good use of these
utilities too.

Would the perfhash.c module be usable for all hash modules
generated by perfect_hash.py ?

The tables generated by perfect_hash.py could be too
large for some compilers (also it would probably be
a good idea providing the array size -- another source
of compiler warnings). The unicodedatabase module
had the same problem and I solved it by breaking the
tables into pages which are accessed through a small
utility function (see Modules/unicodedata*.c).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/