[Python-Dev] Unicode Implementation Snapshot 2000-02-18

M.-A. Lemburg mal@lemburg.com
Fri, 18 Feb 2000 00:39:17 +0100


Hi everybody,

I've just uploaded a new snapshot to the secret URL.

New in this snapshot is a generic character mapping codec
which can decode and encode a large number of code pages
used on PCs and Macs. 

I used a Unicode mapping file
parser to automatically generate the codecs from the
mapping files available at http://www.unicode.org/
and then included all those files which use less than
10k for the Python source code (with comments).

These codecs are thus available and need some serious
testing:

                       cp855.py               iso_8859_6.py
                       cp856.py               iso_8859_7.py
ascii.py               cp857.py               iso_8859_8.py
charmap.py             cp860.py               iso_8859_9.py
cp1006.py              cp861.py               koi8_r.py
cp1250.py              cp862.py               latin_1.py
cp1251.py              cp863.py               mac_cyrillic.py
cp1252.py              cp864.py               mac_greek.py
cp1253.py              cp865.py               mac_iceland.py
cp1254.py              cp866.py               mac_latin2.py
cp1255.py              cp869.py               mac_roman.py
cp1256.py              cp874.py               mac_turkish.py
cp1257.py              iso_8859_10.py         raw_unicode_escape.py
cp1258.py              iso_8859_13.py         unicode_escape.py
cp424.py               iso_8859_14.py         unicode_internal.py
cp437.py               iso_8859_15.py         utf_16.py
cp737.py               iso_8859_2.py          utf_16_be.py
cp775.py               iso_8859_3.py          utf_16_le.py
cp850.py               iso_8859_4.py          utf_8.py
cp852.py               iso_8859_5.py

All these codecs are stored in the encodings package of
the standard lib and directly useable via the unicode(input,
encoding) and u"abc".encode(encoding) APIs.

I would like some feedback on which of these code pages are
really in common use... we could make all not so common
ones available as separate package then.

Also, I'm curious if we should rename the cpXXX.py files
to cp_XXX.py or not (or whether to just add aliases to the
encodings/aliases.py file for them). The naming scheme
usually defines letters-numbers-etc. but for code pages
the above names are quite common.

Another feature of the patch is that it has some optimizations
for short Unicode strings. Unfortunately, the implementation
still has some bugs, so it is currently disabled. To reenable
it, edit the file Objects/unicodeobject.c and set e.g.

#define STAYALIVE_SIZE_LIMIT       5

This will cause to the Unicode objects on the free list
having a size below or equal to this limit to stay alive
even when on the free list.

Note that this is the final patch for the next week. I'll be
offline until 2000-02-28 and then hope to make some serious progress
on documenting the different parts (most docs are still buried
in the C and header files and the unicode proposal which is
included in the file Misc/unicode.txt).

Now it's up to you to give the code the final swirl... :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/