I like Unicode more than I used to...

Tue Feb 25 05:43:55 EST 2003

In message <200302250800.18315.aleaxit at yahoo.com>, Alex Martelli wrote:
> > > Yes, on a site-wide basis: see the file site.py in your site-packages
> > > directory.  You can either change the blocks guarded by "if 0:" in
> > > that file itself, or add a sitecustomize.py file in the same
> > > directory that calls sys.setdefaultencoding('utf-8') with the
> > > same effect.

You must just have mistyped...
The file site.py is not in your site-packages directory.
It is in the parent directory of your site-packages directory.

> > The file "site.py" is in "../site-packages" while you can put your
> > "sitecustomize.py" in "site-packages".
> 
> Not sure I understand what you mean.  What's the leading .. in
> that "../site-packages" you're claiming site.py is in?!

The leading '..' means the parent directory, you know.

> > I wonder you said it from your actual experiences if you don't mind me
> > saying so.  My experiences tell me another way.
> 
> It's from (modest) actual experience: typically programs assuming 'latin-1'
> (well I _am_ from Western Europe after all) because of reliance on site.py,
> failing to run on sites (e.g. in Eastern Europe) that had not tampered
> with site.py or had other iso-8859-* variants set in site.py.

I see.  I'm afraid you just have a limited experience.  The
latin-1 encoding is the only encoding in which you can write
Unicode literals directly in the current Python.

> > For example, what if I write my program as follows?
> >
> >    u = s.decode('euc-jp')
> >    ...(some work)...
> >    print u.encode('euc-jp')
> >
> > Perhaps you cannot use it.  What if I write it as follows?
> 
> Depending on what is s, i.e. where it comes from, this should work
> anywhere.  

No, 'euc-jp' (Extended Unix Code-Japan) codec is not provided by
default regrettably :-(

> Of course, this depends on the contents of s BEING
> encoded as euc-jp in the first place.  If you don't know when you
> write your code what encoding may be in use, the right approach
> is of course:
>     u = s.decode(encoding)
> and so on, where encoding is a module global variable that you
> set on startup (e.g. from a configuration file, environment, ...) in
> a way that's compatible with the way s is sourced/built up.

You just repeat what I have described more elaborately, don't you?

> > The next to best way is to prepare an ad hoc way to customize the
> > encoding.  For exampe, start your program as follows:
> >
> >    ENCODING = 'euc-jp'   # Please replace 'euc-jp' by your encoding
> >    def dec(s): return s.decode(ENCODING)
> >    def enc(u): return u.encode(ENCODING)
> >    ...
> >    u = dec(s)
> >    ...(some work)...
> >    print enc(u)

And this approach often suffers from lacking consistency over
various programs/libraries in general.

By the way, are you likely to work with euc-jp data everyday?
I'd think the 'default' is there to keep the daily work simple.
Thus
> > Relying
> > on the default is a fairly good practice for many applications.

Anyway relying on the default is the same as you write, say,

    u = s.decode(encoding)

where encoding = sys.getdefaultencoding().  There is no
essential difference actually.  So keep it simple for daily
work.

Of course, if the code is specific to some particular encodings
then the explicitness is preferable.

> It's
> important that the codec name is set in a way that's coordinated with
> the way you build or source your strings, therefore setting that name
> as a literal in your code is only appropriate when the strings' source is
> also from literals in your code (a frequent practice, but hardly one that
> is conducive to good i18n...).

Yes.  
In Python 2.2.*, Unicode literals are taken as latin-1.  For
other encodings, a literal to be Unicode needs the encoding
name.

   u = "(suppose this is japanese)".decode('euc-jp')

In this case you are recommended not to rely on the default if
you are going to distribute it to other places.  For
conciseness, you can write:

   def e2u(s): return s.decode('euc-jp')
   ...
   u = e2u("(suppose this is japanese)")

In Python 2.3a2, you can also write:

   # -*- coding: euc-jp -*-
   ...
   u = u"(suppose this is japanese)"

But,
note that if you have hard-coded messages in your language, they
are often unreadable in other countries (by font settings etc).
You may insist that it is no use to be explicit in such case...

And if you write your code to work internationally enough,
implicitness is preferable.  By setting site.py, it works on
daily data in each language.

Thus,
be explicit in country specific code
  though it is sometimes useless to be so;
be implicit in universal code.

-- SUZUKI Hisao