[Python-Dev] Some thoughts on the codecs...

M.-A. Lemburg mal@lemburg.com
Tue, 16 Nov 1999 13:54:51 +0100


Fredrik Lundh wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8': 8-bit variable length encoding
> >   'utf-16': 16-bit variable length encoding (litte/big endian)
> >   'utf-16-le': utf-16 but explicitly little endian
> >   'utf-16-be': utf-16 but explicitly big endian
> >   'ascii': 7-bit ASCII codepage
> >   'latin-1': Latin-1 codepage
> >   'html-entities': Latin-1 + HTML entities;
> > see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> > Japanese character encoding
> >   'unicode-escape': See Unicode Constructors for a definition
> >   'native': Dump of the Internal Format used by Python
> 
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

>From the proposal:
"""
General Remarks:
----------------

· Unicode encoding names should be lower case on output and
  case-insensitive on input (they will be converted to lower case
  by all APIs taking an encoding name as input).

  Encoding names should follow the name conventions as used by the
  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
  written as 'utf-16'.
"""

Is there a naming scheme definition for these encoding names?
(The quote you gave above doesn't really sound like a definition
to me.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/