[Python-Dev] Python 2.5.1 ported to z/OS and EBCDIC

Mon Oct 22 14:45:56 CEST 2007

Hello.

Based on Jean-Yves Mengant's work on previous versions, I have ported
Python 2.5.1 to z/OS. A patch against current svn head is attached to
<http://bugs.python.org/issue1298>. The same patch should work with very
little changes also against pristine 2.5.1 sources. (The only failing
hunk is for Modules/makesetup, and it is quite trivial.)

I have no opinion on whether the patch should eventually be incorporated
into the main distribution. The port was motivated by internal reasons,
and I'm merely offering it as a community service to anyone else who
might be interested. If Jean-Yves wishes to distribute it from his
z/OS-page, that is fine with me. In general, anyone can do what they
want with the patch, but please give credit.

I'll describe some of the porting issues below.

CHARACTER SETS
==============

The biggest, major difficulty with z/OS is of course the character set.
There are lots of ASCII-dependencies in Python code, and z/OS uses
CP1047, an EBCDIC variant, which is utterly incompatible with ASCII.

There are two possible approaches in this situation. One is to keep on
using ASCII as the execution character set (and also as the default
encoding of string objects), and to add conversion support to everywhere
where we do text-based I/O, so that communication with the external
world still happens in EBCDIC. This was feasible since the z/OS C
compiler does support ASCII as the execution character set. (The source
character set would still remain EBCDIC, though. If you've ever wondered
why the C standard makes a distinction between these, here's a prime
example of a situation where they're different.)

However, I decided against this approach. The I/O conversions would have
been deeply magical, and would have required classic "text mode vs.
binary mode" -crap, which would be rather confusing.

Instead, I followed Jean-Yves' example and kept Python as a "native"
EBCDIC application: input, 8-bit data is treated by default as EBCDIC
everywhere. This only required fixing various ASCII-specific bits in the
code, e.g. stuff like this (in PyString_DecodeEscape):

-		else if (c < ' ' || c >= 0x7f)
+		else if (!isprint((unsigned char) c))

Of course, now this allows unescaped printing of characters if they are
printable in the platform's encoding even if they wouldn't be printable
in ASCII. I'm not sure if this is desirable or not. It would be simple
to fix this so that only characters in the ASCII _character set_ are
displayed varbatim.

A result of making strings EBCDIC-native is that it breaks any code that
depends on string literals being in ASCII. This probably applies to most
network protocol implementations written in Python. On the other hand,
making string literals use ASCII would break code that does ordinary
text processing on local files. Damned if you do, damned if you don't.

The real issue is that strings in Python are rather underspecified.
String objects are really just octet sequences without any _inherent_
textual interpretation for them. This is apparent from the fact that
strings are what are read from and written to binary files, and also
what unicode strings are encoded to and decoded from. However, Python
syntax allows specifying an octet sequence with a _character_ sequence
(i.e. a string literal), and the relationship between the source
characters and the resulting octets has been left implicit. So
programmers aren't really encouraged to think about character set issues
and the end result is code that only works on a platform that uses ASCII
everywhere.

Python already has the property that the meaning of a source file
depends on its encoding: if I write a string literal with some latin-1
characters, the resulting octet sequence depends on whether my source
was encoded in latin-1 or utf-8. I'm not sure if this is a good idea,
but my approach with the z/OS port continues the tradition: when your
source is in EBCDIC, the string literals get encoded in EBCDIC.

All this just shows that treating plain octet sequences as "strings"
simply won't work in the long run. You have to have separate type for
_textual_ data (i.e. Unicode strings, in Python), and encode and decode
between those and octet sequences using some _explicit_ encoding. Of
course, all non-English-speaking people have been keenly aware of this
already for ages. The relative universality of ASCII is an exception
amongst encodings rather than the norm. It's only reasonable to require
English text to require the same attention to encodings as all the other
languages.

UNICODE
-------

The biggest hurdle by far (at least LoC-wise) in the porting was
Unicode. The code assumed that the execution character set was not only
ASCII, but ISO-8859-1, since there was lots of casting back and forth
between Py_UNICODE and char. I added the following conversion operations
into unicodeobject.h:

#ifdef Py_CHARSET_ASCII
# define Py_UNICODE_FROM_CHAR(c) ((Py_UNICODE)(unsigned char)(c))
# define Py_UNICODE_AS_CHAR(u) (u < 0x80 ? (char)(unsigned char)(u) : '\0')
#else
# define Py_UNICODE_FROM_CHAR(c) _PyUnicode_FromChar(c)
# define Py_UNICODE_AS_CHAR(u) _PyUnicode_AsChar(u)
#endif

The Py_UNICODE_AS_CHAR operation maps a unicode character into a char in
the execution character set's encoding, or to '\0' if it's not
representable.

When on a non-ASCII platform, I used the simplest trick of all:

/* Map from ASCII codes to the platform's execution character set, or to
   '\0' if the corresponding character is not known. */
static const char unicode_ascii_table[128] =
    "\0\0\0\0\0\0\0\a\b\t\n\v\f\r\0\0"
    "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    " !\"#$%&'()*+,-./0123456789:;<=>?"
    "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"
    "`abcdefghijklmnopqrstuvwxyz{|}~\0";

(This is reasonably portable, as all the printable ASCII characters
except `, @ and $ are required by C to be present in any source or
execution character set, and of those, Python requires all but $.)

This, and the corresponding reverse index, are good enough for all
purposes in the Python core: converting unicode string literals into
unicode objects, detecting special escape characters, and calculating
digit values. It doesn't allow writing string or unicode literals that
directly contain characters that don't exist in ASCII, though. But since
such code wouldn't be portable across character sets anyway, this isn't
much of a problem.

I also added a Lib/encodings/cp1047.py that does proper recoding outside
the core. It was generated from jdk-1.5.0/CP1047.TXT (from
<http://haible.de/bruno/charsets/conversion-tables/CP1047.html>). This
map seems to best correspond to the actual conventions I have seen on a
z/OS machine.

Now, strings and unicode seem to work together fairly well, even though
the results may be a bit surprising to one used to ASCII and its
extensions:

>>> ord('a')
129

Here 129 is the EBCDIC value of the letter 'a'. The unicode literal
u'a', like all textual input, is itself represented in EBCDIC:

>>> map(ord,"u'a'")
[164, 125, 129, 125]

But when such a literal is parsed, the resulting unicode object has the 
correct value for the corresponding unicode character:

>>> ord(u'a')
97

And, of course, when this unicode literal is printed back or its repr is
taken, it is again encoded to EBCDIC so it shows correctly:

>>> map(ord,repr(u'a'))
[164, 125, 129, 125]

This seems to me to be the Right Thing. Now, as long as no exotic
characters are used directly in the source, source can be translated
between ASCII and EBCDIC so that strings and unicode strings retain
their correct semantic character values, even though the encoding of the
literals themselves is different. String objects have a
platform-dependent encoding, but unicode objects behave the same
everywhere.

One problem with this approach is that it is completely incompatible
with Python's UTF-8 support. The parser assumes that utf-8 (or latin-1)
are supersets of the platform's native encoding, and this of course
isn't true with EBCDIC.

A consequence is that the z/OS port cannot support eval of unicode
strings:

>>> eval('2+2')
4
>>> eval(u'2+2')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1

    ^
SyntaxError: invalid syntax

This is because internally evaluation of unicode strings is implemented
by first encoding the unicode string as utf-8, and then trying to parse
that. And this of course fails.

This seems like a rather complicated and limited way of going about it.
It would be much cleaner and more portable to first decode input into
unicode by various means, and then to parse the unicode. Then unicode
strings would be the ones that don't need any special processing. But
this would require heavy changes to Python's parsing machinery, and I
tried to keep my changes as minimal as possible for now.

PICKLING
--------

One more character set issue arised with pickling. The pickle protocols
are a bit schitzophrenic in the sense that they can't quite decide
whether to be textual or binary protocols. A textual protocol should be
readable, and recodable across platforms to preserve the semantic
character values correctly, whereas a binary protocol should be based on
specific octet values whose readability is not an issue.

The original pickle protocol 0 can be seen either as a textual protocol
(all the pickles are readable), or a binary protocol (when characters
get mapped to their corresponding octet values in ASCII). The other
protocol versions, though extensions of protocol 0, are clearly binary,
since the pickled data is at least partially specified as specific octet
values.

Now, on an EBCDIC platform, it's impossible to have protocol 0 be
textual while still compatible with the other protocols. This is because
e.g. the following opcodes get the same value if we let 'a' be textual
(i.e. encoded in the host platforms's encoding):

APPEND          = 'a'   # append stack top to list below it
NEWOBJ          = '\x81'  # build object by applying cls.__new__ to argtuple

In the end, for now, I made protocol 0 textual, and disabled support for
protocol versions > 0 on non-ASCII platforms. This seems like the safest
choice. It's certainly possible to add support for the binary protocols
and make them explicitly use ASCII, but that again would require
non-trivial changes.

Incidentally, modified_EncodeRawUnicodeEscape in cPickle.c seems to be
out of sync with the one in unicodeobject.c, in that it lacks support
for Py_UNICODE_WIDE. Also, both versions generate a latin-1 string as
output, which doesn't seem portable enough. My patch recodes characters
in ASCII to the execution character set, and escapes everything else,
even characters in U+0080 - U+00FF -range. (Though strictly, all the
latin-1 characters happen to be representable in CP1047. But this is not
something that I think it's good to depend upon.)

INTEGER PARSING
---------------

There were quite a number of places where (hex) digits were parsed
nonportably. I added the following to longobject.h, and used that:

PyAPI_FUNC(int) _PyLong_DigitValue(char c);

This resulted in some nice cleanups. From PyString_DecodeEscape:

-				unsigned int x = 0;
-				c = Py_CHARMASK(*s);
-				s++;
-				if (isdigit(c))
-					x = c - '0';
-				else if (islower(c))
-					x = 10 + c - 'a';
-				else
-					x = 10 + c - 'A';
-				x = x << 4;
-				c = Py_CHARMASK(*s);
-				s++;
-				if (isdigit(c))
-					x += c - '0';
-				else if (islower(c))
-					x += 10 + c - 'a';
-				else
-					x += 10 + c - 'A';
-				*p++ = x;
+				int xh = _PyLong_DigitValue(*s++);
+				int xl = _PyLong_DigitValue(*s++);
+				*p++ = Py_CHARMASK(xh * 16 + xl);
 				break;

OTHER ISSUES
============

Most of the other changes are boring build-technical issues and tweaks
to make things compile on z/OS's very spartan support for Unix-like
facilities. I hard-coded various #ifdef __MVS__ bits here and there to
make things compile. I guess these things should properly be checked by
configure, but I'm not very good at autoconf magic, and besides, running
configure takes _ages_ on the machine I'm using, so I wasn't inclined to
tweak the scripts any more than I had to.

The dynamic loading support in dynload_mvs.c is verbatim from Jean-Yves'
modifications. I just cleaned it up a little.

I have only tested this with --enable-shared (which does what
--with-zdll did in Jean-Yves' version, i.e. enables shared libraries).
Without shared libraries the building of extensions may well fail
because of some linkage tweaks in Lib/distutils/unixccompiler.py. I hope
there is some way of deciding what to do depending on whether shared
libraries are enabled or not.

One nasty difficulty was that the makefile implicitly assumes that
shared libraries are named libpython2.x.dll only on Windows. However,
they have that name on z/OS, too. I resolved this with a simple "case
$(MACHDEP)" in the rule for building the library, but hopefully someone
can come up with a prettier solution.

Various wrappers for external libraries are untested. Certainly it might
be possible to install zlib, libbz2, openssl and various other nifty
libraries on z/OS, and see if the Python wrappers work, but that is an
undertaking that I will pass at least for now.

Quite a number of tests fail simply because they assume that strings are
encoded in ASCII. For instance, Lib/test/test_calendar.py fails because
the expected result is:

result_2004_html = """
<?xml version="1.0" encoding="ascii"?>
... """

And the real result begins with:

<?xml version="1.0" encoding="cp1047"?> ...

There were so many of these kinds of failures that there may be some
_actual_ problems amongs them that I've overlooked.

That is about all. Comments are welcome. I'd be especially interested in
hearing if my patch works on any other machine besides the one I was
using. :)

-- 
Lauri Alanko                                           Software Engineer
SSH Communications Security Corp                Mobile: +358-40-864-3037 
Valimotie 17, FI-00380, Helsinki, Finland          Tel: +358-20-500-7000
http://www.ssh.com/                                Fax: +358-20-500-7001