[Python-Dev] Unicode charnames impl.

Christian Tismer tismer@tismer.com
Sat, 25 Mar 2000 14:35:50 +0100


"Andrew M. Kuchling" wrote:
...
> 3) How can we store all those names?  The resulting dictionary makes a
> 361K .py file; Python dumps core trying to parse it.  (Another bug...)

This is simply not the place to use a dictionary.
You don't need fast lookup from names to codes,
but something that supports incremental search.
This would enable PythonWin to sho a pop-up list after
you typed the first letters.

I'm working on a common substring analysis that makes
each entry into 3 to 5 small integers.
You then encode these in an order-preserving way. That means,
the resulting code table is still lexically ordered, and
access to the sentences is done via bisection.
Takes me some more time to get that, but it will not
be larger than 60k, or I drop it.
Also note that all the names use uppercase letters and space
only. An opportunity to use simple context encoding and
use just 4 bits most of the time.

...
> I've also add a script that parses the names out of the NameList.txt
> file at ftp://ftp.unicode.org/Public/UNIDATA/.

Is there any reason why you didn't use the UnicodeData.txt file,
I mean do I cover everything if I continue to use that?

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home