[Python-Dev] Encoding of 8-bit strings and Python source code

M.-A. Lemburg mal@lemburg.com
Tue, 25 Apr 2000 11:43:46 +0200


This is a multi-part message in MIME format.
--------------9972D9B8E9394EC8828CF147
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

After the discussion about #pragmas two weeks ago and some
interesting ideas in the direction of source code encodings
and ways to implement them,  I would like to restart the
talk about encodings in source code and runtime
auto-conversions.

Fredrik recently posted patches to the patches list which
loosen the currently hard-coded default encoding used throughout
the Unicode design and add a layer of abstraction which would
make it easily possible to change the default encoding at some
later point. While making things more abstract is certainly
a wise thing to do, I am not sure whether this particular
case fits into the design decisions made a few months ago.

Here's a short summary of what was discussed recently:

1. Fredrik posted the idea of changing the default encoding
from UTF-8 to Latin-1 (he calls this 8-bit Unicode which
points to the motivation behind this: 8-bit strings should
behave like 8-bit Unicode). His recent patches work into
this direction.

2. Fredrik also posted an interesting idea which enables
writing Python source code in any supported encoding by
having the Python tokenizer read Py_UNICODE data instead
of char data. A preprocessor would take care of converting
the input to Py_UNICODE; the parser would assure that
8-bit string data gets converted back to char data (using
e.g. UTF-8 or Latin-1 for the encoding)

3. Regarding the addition of pragmas to allow specifying
the used source code encoding several possibilities were
mentioned:
- addition of a keyword "pragma" to define pragma dictionaries
- usage of a "global" as basis for this
- adding a new keyword "decl" which also allows defining other
  things such as type information
- XML like syntax embedded into Python comments

Some comments:

Ad 1. UTF-8 is used as basis in many other languages such 
as TCL or Perl. It is not an intuitive way of
writing strings and causes problems due to one character
spanning 1-6 bytes. Still, the world seems to be moving
into this direction, so going the same way can't be all
wrong... Note that stream IO can be recoded in a way
which allows Python to print and read e.g. Latin-1
(see below). The general idea behind the fixed default
encoding design was to give all the power to the user,
since she eventually knows best which encoding to
use or expect.

Ad 2. I like this idea because it enables writing Unicode-
aware programs *in* Unicode... the only problem which remains
is again the encoding to use for the classic 8-bit strings.

Ad 3. For 2. to work, the encoding would have to appear
close to the top of the file. The preprocessor would have
to be BOM-mark aware to tell whether UTF-16 or some ASCII
extension is used by the file.

Guido asked me for some code which demonstrates Latin-1
recoding using the existing mechanisms. I've attached
a simple script to this mail. It is not much tested yet,
so please give it a try. 

You can also change it to use any other encoding you like.
Together with the Japanese codecs provided by Tamito Kajiyama
(http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz)
you should be able to type Shift-JIS at the raw_input()
or interactive prompt, have it stored as UTF-8 and then
printed back as Shift-JIS, provided you put add a recoder
similar to the attached one for Latin-1 to your
PYTHONSTARTUP or site.py script.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------9972D9B8E9394EC8828CF147
Content-Type: text/python; charset=us-ascii;
 name="latin1io.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="latin1io.py"

""" Redirect sys.std[in|out|err] to have them use Latin-1 as
    encoding.

    Marc-Andre Lemburg, 2000-04-25.

"""#"

import codecs,sys,types

class Latin1IO(codecs.StreamRecoder):

    """ Latin-1 Recoder.

        Translates streams encoded in Latin-1 to UTF-8. The Python
        interface will return UTF-8 encoded strings and will accept
        both Unicode and UTF-8 encoded strings as input.

    """
    def __init__(self,stream,errors='strict'):

        """ Creates a Latin1IO instance.

            stream must be a file-like object.

            Error handling is done in the same way as defined for the
            codecs.StreamWriter/Readers.

        """
        self.stream = stream
        self.errors = errors

        # Stream backend should translate Unicode <-> Latin-1
        (Reader,Writer) = codecs.lookup('latin-1')[2:4]
        self.reader = Reader(stream, errors)
        self.writer = Writer(stream, errors)

        # Interface frontend should translate UTF-8 <-> Unicode
        (encode,decode) = codecs.lookup('utf-8')[0:2]
        self.encode = encode
        self.decode = decode

    def write(self,data):

        if type(data) is not types.UnicodeType:
            data, bytesdecoded = self.decode(data, self.errors)
        return self.writer.write(data)

    def writelines(self,list):

        if type(data) is not types.UnicodeType:
            data = ''.join(list)
            data, bytesdecoded = self.decode(data, self.errors)
        else:
            data = u''.join(list)
        return self.writer.write(data)

if __name__ == '__main__':
    # Redirect all standard IO streams
    sys.stdin = Latin1IO(sys.stdin)
    sys.stdout = Latin1IO(sys.stdout)
    sys.stderr = Latin1IO(sys.stderr)

--------------9972D9B8E9394EC8828CF147--