[Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Sun Jun 17 08:02:02 EDT 2018


Folks.  There are standards.  "1252" *is not* an alias for
"windows-1252" according to the IANA, while "866" *is* an alias for
"IBM866" according to the same authority.  Most 3-digit "IBMxxx" ARE
aliased to both "cpxxx" and just "xxx", but not all.  None of
"IBM874", "874", or "cp874" exists according to the IANA.

https://www.iana.org/assignments/character-sets/character-sets.xhtml

For the reasons Steven gave, I would say omit the digits-only aliases,
but if we must use them because "there's a standard" (or backward
compatibility), we should stick to those defined by standard, and only
those.

If we're following other standards that I'm unaware of, fine, but
let's cite them rather than randomly introduce a plethora of aliases
because they "look like" an existing (and unfortunate) standard.

There's also some other weirdness with "windows-874", see below.  We
(somebody) should check other "windows-xxx" character sets to make
sure they're not misnamed "cpxxx".

Steven D'Aprano writes:
 > > It is easy to test it. Encoding/decoding with '874' should give the 
 > > same result as with 'cp874'.
 > 
 > I know it is too late to remove that feature, but why do we support 
 > digit-only IDs for encodings? They can be ambiguous. If Wikipedia is 
 > correct, cp874 (also known as ibm874) and Windows-874 (also known as 
 > cp1162) are different:

According to the IANA, they're not necessarily ambiguous.  Here is
the entry for IBM866:

IBM866 	2086 	IBM NLDG Volume 2	 	cp866
                (SE09-8002-03) August 1994      866
 	        [Rick_Pond]                     csIBM866

where the entries in column 4 show the registered aliases.  There are
at least a dozen IBMxxx character sets with 'xxx' aliases.

I don't understand what's with "cp874", though.  We can surely take
that one back, although we'd better hurry if it's in 3.7rc.  We might
want to add "windows-874" (which does't seem to be present in Python
3.6), since that's the standard character set name per IANA.

The confusion between cp874 and windows-874 may be because in
VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages
there).

 > https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874
 > 
 > https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162

I don't know where Wikipedia's information comes from, but it's not
the IANA.


-- 
Associate Professor              Division of Policy and Planning Science
http://turnbull.sk.tsukuba.ac.jp/     Faculty of Systems and Information
Email: turnbull at sk.tsukuba.ac.jp                   University of Tsukuba
Tel: 029-853-5175                 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN


More information about the Python-ideas mailing list