[Python-Dev] Ill-defined encoding for CP875?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 12 May 2001 22:12:39 +0200


> But I don't know whether the ambiguity in cp875 is a bug or an
> undocumented feature

The official (as in "as official as it gets") mapping between CP 875
and Unicode is at

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP875.TXT

This is also the file which served as an input to generate cp875.py.

Character 1A, which is the mapping result of these characters, is
indeed known with the name "SUBSTITUTE", apparently following the
definition in

http://www.its.bldrdoc.gov/fs-1037/dir-035/_5170.htm

# substitute character (SUB): A control character that is used in the
# place of a character that is recognized to be invalid or in error or
# that cannot be represented on a given device.

That would suggest that these characters in EBCDIC 875 do not have
equivalents in Unicode. However,

http://www.kostis.net/charsets/ebc875.htm

suggests that the characters in question (3F, DC, E1, EC, ED, FC, and
FD) have no character meaning at all.

It seems that IBM's ICU library also maps U+001A to character 3F, see

http://oss.software.ibm.com/developerworks/opensource/cvs/icu/data/ibm-875_P100-2000.ucm?rev=1.1&content-type=text/x-cvsweb-markup

It appears, from looking at

http://www.natural-innovations.com/boo/asciiebcdic.html

that byte 3F *is* the substitution character in EBCDIC. So it is a bug
in the CP875 codec to map Unicode SUBSTITUTE to an arbitrary EBCDIC
character which is mapped to SUBSTITUTE; I think cp875 should be
corrected to always map U+001A to 3F. That is not something the
generator can currently do, though.

So I think we can take one of two approaches:

1. admit that CP 875 is not round-trippable, and exclude it from the
   test (although when looking at the first 128 characters only, it
   is round-trippable).
2. remove the SUBSTITUTE mappings from CP875, acknowledging that
   apparently these characters have no meaning in that code page.
   Unfortunately, I could not find any official IBM documentation
   page that lists the characters supported in each of the EBCDIC
   code pages.

The second seems to be more corrrect to me, although it is a deviation
from the Unicode consortium publications.

Regards,
Martin