[Python-Dev] Some thoughts on the codecs...

Andy Robinson andy@robanal.demon.co.uk
Mon, 15 Nov 1999 21:41:21 GMT


On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:

>These are all great ideas, but I think they unnecessarily
>complicate the proposal.

However, to claim that Python is properly internationalized, we will
need a large number of multi-byte encodings to be available.  It's a
large amount of work, it must be provably correct, and someone's going
to have to do it.  So if anyone with more C expertise than me - not
hard :-) - is interested

I'm not suggesting putting my points in the Unicode proposal - in
fact, I'm very happy we have a proposal which allows for extension,
and lets us work on the encodings separately (and later).

>Since Codecs can be registered at runtime, there is quite
>some potential there for extension writers coding their
>own fast codecs. E.g. one could use mxTextTools as codec
>engine working at C speeds.
Exactly my thoughts , although I was thinking of a more slimmed down
and specialized one.  The right tool might be usable for things like
compression algorithms too.  Separate project to the Unicode stuff,
but if anyone is interested, talk to me.

>I would propose to only add some very basic encodings to
>the standard distribution, e.g. the ones mentioned under
>Standard Codecs in the proposal:
>
>  'utf-8':		8-bit variable length encoding
>  'utf-16':		16-bit variable length encoding (litte/big endian)
>  'utf-16-le':		utf-16 but explicitly little endian
>  'utf-16-be':		utf-16 but explicitly big endian
>  'ascii':		7-bit ASCII codepage
>  'latin-1':		Latin-1 codepage
>  'html-entities':	Latin-1 + HTML entities;
>			see htmlentitydefs.py from the standard Pythin Lib
>  'jis' (a popular version XXX):
>			Japanese character encoding
>  'unicode-escape':	See Unicode Constructors for a definition
>  'native':		Dump of the Internal Format used by Python
>
Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
are lots of options about how to do it.  The other ones are
algorithmic and can be small and fast and fit into the core.

Ditto with HTML, and maybe even escaped-unicode too.

In summary, the current discussion is clearly doing the right things,
but is only covering a small percentage of what needs to be done to
internationalize Python fully.

- Andy