[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Thu, 11 Nov 1999 15:47:49 +0100


Guido van Rossum wrote:
> 
> > Let me tell you why you would want to have an encoding
> > which can be set:
> >
> > (1) sday I am on a Japanese Windows box, I have a
> > string called 'address' and I do 'print address'.  If
> > I see utf8, I see garbage.  If I see Shift-JIS, I see
> > the correct Japanese address.  At this point in time,
> > utf8 is an interchange format but 99% of the world's
> > data is in various native encodings.
> >
> > Analogous problems occur on input.
> >
> > (2) I'm using htmlgen, which 'prints' objects to
> > standard output.  My web site is supposed to be
> > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> > etc.)  Yes, browsers CAN detect and display UTF8 but
> > you just don't find UTF8 sites in the real world - and
> > most users just don't know about the encoding menu,
> > and will get pissed off if they have to reach for it.
> >
> > Ditto for streaming output in some protocol.
> >
> > Java solves this (and we could too by hacking stdout)
> > using Writer classes which are created as wrappers
> > around an output stream and can take an encoding, but
> > you lose the flexibility to 'just print'.
> >
> > I think being able to change encoding would be useful.
> >  What I do not want is to auto-detect it from the
> > operating system when Python boots - that would be a
> > portability nightmare.
> 
> You almost convinced me there, but I think this can still be done
> without changing the default encoding: simply reopen stdout with a
> different encoding.  This is how Java does it.  I/O streams with an
> encoding specified at open() are a very powerful feature.  You can
> hide this in your $PYTHONSTARTUP.

True and it probably covers all cases where setting the
default encoding to something other than UTF-8 makes sense.

I guess you've convinced me there ;-)

The current proposal has wrappers around stream for this purpose:

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as
    short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which
    also assures that <mode> contains the 'b' character when needed.

The above can be done using:

import sys,unicodec
sys.stdin = unicodec.stream(sys.stdin,'jis')
sys.stdout = unicodec.stream(sys.stdout,'jis')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/