[Python-Dev] Unicode patches checked in

Wed, 15 Mar 2000 18:26:15 +0100

Christian Tismer wrote:
> 
> Fredrik Lundh wrote:
> >
> > CT:
> > > How do I build a dist that doesn't need to change a lot of
> > > stuff in the user's installation?
> >
> > somewhere in this thread, Guido wrote:
> >
> > > BTW, I added a tag "pre-unicode" to the CVS tree to the revisions
> > > before the Unicode changes were made.
> >
> > maybe you could base SLP on that one?
> 
> I have no idea how this works. Would this mean that I cannot
> get patctes which come after unicode?
> 
> Meanwhile, I've looked into the sources. It is easy for me
> to get rid of the problem by supplying my own unicodedata.c,
> where I replace all functions by some unimplemented exception.

No need (see my other posting): simply disable the module
altogether... this shouldn't hurt any part of the interpreter
as the module is a user-land only module.

> Furthermore, I wondered about the data format. Is the unicode
> database used inyou re package as well? Otherwise, I see
> only references form unicodedata.c, and that means the data
> structure can be massively enhanced.
> At the moment, that baby is 64k entries long, with four bytes
> and an optional string.
> This is a big waste. The strings are almost all some distinct
> <xxx> prefixes, together with a list of hex smallwords. This
> is done as strings, probably this makes 80 percent of the space.

I have made no attempt to optimize the structure... (due
to lack of time mostly) the current implementation is
really not much different from a rewrite of the UnicodeData.txt
file availble at the unicode.org site.

If you want to, I can mail you the marshalled Python dict version of
that database to play with.

> The only function that uses the "decomposition" field (namely
> the string) is unicodedata_decomposition. It does nothing
> more than to wrap it into a PyObject.
> We can do a little better here. I gues I can bring it down
> to a third of this space without much effort, just by using
> - binary encoding for the <xxx> tags as enumeration
> - binary encoding of the hexed entries
> - omission of the spaces
> Instead of a 64 k of structures which contain pointers anyway,
> I can use a 64k pointer array with offsets into one packed
> table.
> 
> The unicodedata access functions would change *slightly*,
> just building some hex strings and so on. I guess this
> is not a time critical section?

It may be if these functions are used in codecs, so you should
pay attention to speed too...

> Should I try this evening? :-)

Sure :-) go ahead...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/