[Python-Dev] Unicode codec names

M.-A. Lemburg mal@lemburg.com
Mon, 28 Feb 2000 23:02:17 +0100


Hi everybody,

As you may have noticed, the latest Unicode snapshot
contains a large number of new codecs. Most of them are
based on a generic mapping codec which makes adding
new codecs a very simple (even automated) task.

I've gotten some feedback on the compatibility of
the JPython Unicode implementation (actually the underlying
Java one) and the new CPython code. Finn Bock mentioned
that Java uses a slightly different naming scheme and
also has some differences in the code-page-to-Unicode
mappings.

* Could someone provide a list of all default code pages
and other encodings that Java supports ? It would be
ideal to provide the same set for CPython, IMHO.

So far I've got these encodings:

                       cp852.py               iso_8859_5.py
                       cp855.py               iso_8859_6.py
ascii.py               cp856.py               iso_8859_7.py
charmap.py             cp857.py               iso_8859_8.py
cp037.py               cp860.py               iso_8859_9.py
cp1006.py              cp861.py               koi8_r.py
cp1250.py              cp862.py               latin_1.py
cp1251.py              cp863.py               mac_cyrillic.py
cp1252.py              cp864.py               mac_greek.py
cp1253.py              cp865.py               mac_iceland.py
cp1254.py              cp866.py               mac_latin2.py
cp1255.py              cp869.py               mac_roman.py
cp1256.py              cp874.py               mac_turkish.py
cp1257.py              iso_8859_10.py         raw_unicode_escape.py
cp1258.py              iso_8859_13.py         unicode_escape.py
cp424.py               iso_8859_14.py         unicode_internal.py
cp437.py               iso_8859_15.py         utf_16.py
cp737.py               iso_8859_2.py          utf_16_be.py
cp775.py               iso_8859_3.py          utf_16_le.py
cp850.py               iso_8859_4.py          utf_8.py

Encoding names map to these module names in the following
way:

1. convert all hyphens to underscores
2. convert all chars to lowercase
3. apply an alias dictionary to the resulting name

Thus u"abc".encode('KOI8-R') and u"abc".encode('koi8_r')
will result in the same codec being used.

* There's also another issue: code pages with names cpXXXX
come from two sources: IBM and MS. Unfortunately, some of
these pages don't match even though they carry the same name.

Could someone verify whether the included maps work on
Windows, DOS and Mac platforms as intended ? (Finn reported
some divergence between the Java view of things and the
maps I created from the ftp.unicode.org site ones.)

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/