[ python-Bugs-1249749 ] Encodings and aliases do not match runtime

SourceForge.net noreply at sourceforge.net
Thu Aug 11 07:56:42 CEST 2005


Bugs item #1249749, was opened at 2005-08-01 20:23
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1249749&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Documentation
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: liturgist (liturgist)
Assigned to: Nobody/Anonymous (nobody)
Summary: Encodings and aliases do not match runtime

Initial Comment:
2.4.1 documentation has a list of standard encodings in
4.9.2.  However, this list does not seem to match what
is returned by the runtime.  Below is code to dump out
the encodings and aliases.  Please tell me if anything
is incorrect.

In some cases, there are many more valid aliases than
listed in the documentation.  See 'cp037' as an example.

I see that the identifiers are intended to be case
insensitive.  I would prefer to see the documentation
provide the identifiers as they will appear in
encodings.aliases.aliases.  The only alias containing
any upper case letters appears to be 'hp_roman8'.

$ cat encodingaliases.py
#!/usr/bin/env python
import sys
import encodings

def main():
    enchash = {}

    for enc in encodings.aliases.aliases.values():
        enchash[enc] = []
    for encalias in encodings.aliases.aliases.keys():
       
enchash[encodings.aliases.aliases[encalias]].append(encalias)

    elist = enchash.keys()
    elist.sort()
    for enc in elist:
        print enc, enchash[enc]

if __name__ == '__main__':
    main()
    sys.exit(0)
13:12 pwatson [
ruth.knightsbridge.com:/home/pwatson/src/python ] 366
$ ./encodingaliases.py
ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367',
'iso646_us', 'us', 'cp367', '646', 'us_ascii',
'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991',
'ansi_x3.4_1968']
base64_codec ['base_64', 'base64']
big5 ['csbig5', 'big5_tw']
big5hkscs ['hkscs', 'big5_hkscs']
bz2_codec ['bz2']
cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl',
'037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca']
cp1026 ['csibm1026', 'ibm1026', '1026']
cp1140 ['1140', 'ibm1140']
cp1250 ['1250', 'windows_1250']
cp1251 ['1251', 'windows_1251']
cp1252 ['windows_1252', '1252']
cp1253 ['1253', 'windows_1253']
cp1254 ['1254', 'windows_1254']
cp1255 ['1255', 'windows_1255']
cp1256 ['1256', 'windows_1256']
cp1257 ['1257', 'windows_1257']
cp1258 ['1258', 'windows_1258']
cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424']
cp437 ['ibm437', '437', 'cspc8codepage437']
cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch',
'ebcdic_cp_be']
cp775 ['cspc775baltic', '775', 'ibm775']
cp850 ['ibm850', 'cspc850multilingual', '850']
cp852 ['ibm852', '852', 'cspcp852']
cp855 ['csibm855', 'ibm855', '855']
cp857 ['csibm857', 'ibm857', '857']
cp860 ['csibm860', 'ibm860', '860']
cp861 ['csibm861', 'cp_is', 'ibm861', '861']
cp862 ['cspc862latinhebrew', 'ibm862', '862']
cp863 ['csibm863', 'ibm863', '863']
cp864 ['csibm864', 'ibm864', '864']
cp865 ['csibm865', 'ibm865', '865']
cp866 ['csibm866', 'ibm866', '866']
cp869 ['csibm869', 'ibm869', '869', 'cp_gr']
cp932 ['mskanji', '932', 'ms932', 'ms_kanji']
cp949 ['uhc', 'ms949', '949']
cp950 ['ms950', '950']
euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004']
euc_jisx0213 ['eucjisx0213']
euc_jp ['eucjp', 'ujis', 'u_jis']
euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001',
'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001']
gb18030 ['gb18030_2000']
gb2312 ['chinese', 'euc_cn', 'csiso58gb231280',
'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980',
'gb2312_80']
gbk ['cp936', 'ms936', '936']
hex_codec ['hex']
hp_roman8 ['csHPRoman8', 'r8', 'roman8']
hz ['hzgb', 'hz_gb_2312', 'hz_gb']
iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp']
iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1']
iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2']
iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004']
iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3']
iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext']
iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr']
iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992',
'iso_ir_157', 'iso_8859_10', 'latin6']
iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001']
iso8859_13 ['iso_8859_13']
iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8',
'iso_8859_14_1998', 'iso_8859_14', 'latin8']
iso8859_15 ['iso_8859_15']
iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226',
'latin10', 'iso_8859_16']
iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101',
'iso_8859_2', 'iso_8859_2_1987', 'latin2']
iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109',
'csisolatin3', 'iso_8859_3', 'latin3']
iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110',
'iso_8859_4', 'iso_8859_4_1988', 'latin4']
iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic',
'csisolatincyrillic', 'iso_ir_144']
iso8859_6 ['iso_8859_6_1987', 'iso_ir_127',
'csisolatinarabic', 'asmo_708', 'iso_8859_6',
'ecma_114', 'arabic']
iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7',
'iso_ir_126', 'elot_928', 'iso_8859_7_1987',
'csisolatingreek', 'greek']
iso8859_8 ['iso_8859_8_1988', 'iso_ir_138',
'iso_8859_8', 'csisolatinhebrew', 'hebrew']
iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9',
'csisolatin5', 'latin5', 'iso_ir_148']
johab ['cp1361', 'ms1361']
koi8_r ['cskoi8r']
latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1',
'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1',
'latin1', 'iso_8859_1_1987', '8859']
mac_cyrillic ['maccyrillic']
mac_greek ['macgreek']
mac_iceland ['maciceland']
mac_latin2 ['maccentraleurope', 'maclatin2']
mac_roman ['macroman']
mac_turkish ['macturkish']
mbcs ['dbcs']
ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154']
quopri_codec ['quopri', 'quoted_printable',
'quotedprintable']
rot_13 ['rot13']
shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis']
shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004']
shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213']
tactis ['tis260']
tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0',
'iso_ir_166', 'tis_620_0']
utf_16 ['utf16', 'u16']
utf_16_be ['utf_16be', 'unicodebigunmarked']
utf_16_le ['utf_16le', 'unicodelittleunmarked']
utf_7 ['u7', 'utf7']
utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8']
uu_codec ['uu']
zlib_codec ['zlib', 'zip']


----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2005-08-11 07:56

Message:
Logged In: YES 
user_id=21627

I think the presence of iso8859_1.py is a bug, resulting
from automatic generation of these files. The file should be
deleted; iso8859-1 should be encoded through the alias to
latin-1. Thanks for pointing that out.

----------------------------------------------------------------------

Comment By: liturgist (liturgist)
Date: 2005-08-11 04:54

Message:
Logged In: YES 
user_id=197677

For example: there appears to be a codec for iso8859-1, but
it has no alias in the encodings.aliases.aliases list and it
is not in the current documentation.

What is the relationship of iso8859_1 to latin_1?  Should
iso8859_1 be considered a base codec?  When should iso8859_1
be used rather than latin_1?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2005-08-10 23:59

Message:
Logged In: YES 
user_id=21627

I do see a problem with generating these tables
automatically. It suggests the reader that the aliases are
all equally relevant. However, I bet few people have ever
heard of or used, say, 'cspc850multilingual'.

As for the actual patch: Please don't generate HTML.
Instead, TeX should be generated, as this is the primary
source. Also please add a patch to the current TeX file,
updating it appropriately.

----------------------------------------------------------------------

Comment By: liturgist (liturgist)
Date: 2005-08-10 23:29

Message:
Logged In: YES 
user_id=197677

The script attached generates two HTML tables in files
specified on the command line.

    usage:  encodingaliases.py
<language-oriented-codecs-html-file>
<non-language-oriented-codecs-html-file>

A static list of codecs in this script is used because the
language description is not available in the python runtime.
 Codecs found in the encodings.aliases.aliases list are
added to the list, but will be described as "unknown" encodings.

The "bijectiveType" was, like the language descriptions,
taken from the current (2.4.1) documentation.

It would be much better for the descriptions and "bijective"
type settings to come from the runtime.  The problem is one
of maintenance.  Without these available for introspection
in the runtime, a new encoding with no alias will never be
identified.  When it does appear with an alias, it can only
be described as "unknown."

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-08-06 14:49

Message:
Logged In: YES 
user_id=38388

Martin, I don't see any problem with putting the complete
list of aliases into the documentation.

liturgist, don't worry about hard-coding things into the
script. The extra information Martin gave in the table is
not likely going to become part of the standard lib, because
there's no a lot you can do with it programmatically.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2005-08-06 14:41

Message:
Logged In: YES 
user_id=21627

I would not like to see the documentation contain a complete
list of all aliases. The documentation points out that this
are "a few common aliases", ie. I selected aliases that
people are likely to encounter, and are encouraged to use.

I don't think it is useful to produce the table from the
code. If you want to know everything in aliases, just look
at aliases directly.

----------------------------------------------------------------------

Comment By: liturgist (liturgist)
Date: 2005-08-05 19:53

Message:
Logged In: YES 
user_id=197677

I would very much like to produce the doc table from code. 
However, I have a few questions.

It seems that encodings.aliases.aliases is a list of all
encodings and not necessarily those supported on all
machines.  Ie. mbcs on UNIX or embedded systems that might
exclude some large character sets to save space.  Is this
correct?  If so, will it remain that way?

To find out if an encoding is supported on the current
machine, the code should handle the exception generated when
codecs.lookup() fails.  Right?

To generate the table, I need to produce the "Languages"
field.  This information does not seem to be available from
the Python runtime.  I would much rather see this
information, including a localized version of the string,
come from the Python runtime, rather than hardcode it into
the script.  Is that a possibility?   Would it be a better
approach?

The non-language oriented encodings such as base_64 and
rot_13 do not seem to have anything that distinguishes them
from human languages.  How can these be separated out
without hardcoding?

Likewise, the non-language encodings have an "Operand type"
field which would need to be generated.  My feeling is,
again, that this should come from the Python runtime and not
be hardcoded into the doc generation script.  Any suggestions?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-08-04 16:47

Message:
Logged In: YES 
user_id=38388

Doc patches are welcome - perhaps you could enhance your
script to have the doc table generated from the available
codecs and aliases ?!

Thanks.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1249749&group_id=5470


More information about the Python-bugs-list mailing list