[Python-Dev] RE: Ill-defined encoding for CP875?

Tim Peters tim.one@home.com
Sat, 12 May 2001 17:48:38 -0400


[Martin v. Loewis, whose encyclopedic knowledge of encoding details
 still isn't enough to get a clear answer (it's like somebody asking
 me for a simple answer to a floating point question <wink>]

> ...
> So I think we can take one of two approaches:
>
> 1. admit that CP 875 is not round-trippable, and exclude it from the
>    test (although when looking at the first 128 characters only, it
>    is round-trippable).

As I noted later, 875 is already excluded from the roundtrip test across
range(128, 256).  What it's failing is the roundtrip test across range(128):
after unicode("?", "cp875") produces u'\x1a', the following .encode('c875')
has no way to know which range the original input came from.  So it's not
really round-trippable across range(128) either unless more info is given to
.encode().

> 2. remove the SUBSTITUTE mappings from CP875, acknowledging that
>    apparently these characters have no meaning in that code page.
>    Unfortunately, I could not find any official IBM documentation
>    page that lists the characters supported in each of the EBCDIC
>    code pages.
>
> The second seems to be more corrrect to me, although it is a deviation
> from the Unicode consortium publications.

Until you and MAL agree on the best thing to do (I have no opinion:  my only
exposure to Unicode in daily programming life remains the Python test suite),
I'm going to opt for #1:  as cp875.py stands today, it's simply a fact that
it's not round-trippable across any range including 0x3f.