I like Unicode more than I used to...

Alex Martelli aleaxit at yahoo.com
Tue Feb 25 02:00:18 EST 2003


On Tuesday 25 February 2003 04:47 am, SUZUKI Hisao wrote:
> In message <DHo6a.224849$0v.6336159 at news1.tin.it>, Alex Martelli wrote:
> > gabor wrote:
> >    ...
> >
> > > is there a way to specify in python some kind of default-encoding? i
> > > mean can i somehow tell him that when printing unicode strings, i
> > > always want to use utf-8, so that .encode('utf-8') isn't necessary?
> >
> > Yes, on a site-wide basis: see the file site.py in your site-packages
> > directory.  You can either change the blocks guarded by "if 0:" in
> > that file itself, or add a sitecustomize.py file in the same
> > directory that calls sys.setdefaultencoding('utf-8') with the
> > same effect.
>
> The file "site.py" is in "../site-packages" while you can put your
> "sitecustomize.py" in "site-packages".

Not sure I understand what you mean.  What's the leading .. in
that "../site-packages" you're claiming site.py is in?!


> > Of course, if you do that you're likely to write programs that
> > will only run on your site (or other similarly customized) and
> > not be suitable for general distribution to other sites.  But
> > if that's OK with you, Python lets you do it.
>
> I wonder you said it from your actual experiences if you don't mind me
> saying so.  My experiences tell me another way.

It's from (modest) actual experience: typically programs assuming 'latin-1'
(well I _am_ from Western Europe after all) because of reliance on site.py,
failing to run on sites (e.g. in Eastern Europe) that had not tampered
with site.py or had other iso-8859-* variants set in site.py.


> For example, what if I write my program as follows?
>
>    u = s.decode('euc-jp')
>    ...(some work)...
>    print u.encode('euc-jp')
>
> Perhaps you cannot use it.  What if I write it as follows?

Depending on what is s, i.e. where it comes from, this should work
anywhere.  Of course, this depends on the contents of s BEING
encoded as euc-jp in the first place.  If you don't know when you
write your code what encoding may be in use, the right approach
is of course:
    u = s.decode(encoding)
and so on, where encoding is a module global variable that you
set on startup (e.g. from a configuration file, environment, ...) in
a way that's compatible with the way s is sourced/built up.


>    u = s.decode()
>    ...(some work)...
>    print u.encode()
>
> Perhaps you can use it for your data now, if you set your "site.py"
> appropriately.

This makes the (too-strong) hypothesis that ALL programs I run
on my site build/source all their "s"'s the same way.


> It is important that you do not guess what encoding your users choose
> if you are going to distribute your program to other sites.  Relying
> on the default is a fairly good practice for many applications.  (Of

Here, clearly, is where we disagree.

> course, you should test your program under the 'ascii' environment at
> least before you distribute it.)

This will fortunately break many incorrect programs, but, alas, not all.


> The next to best way is to prepare an ad hoc way to customize the
> encoding.  For exampe, start your program as follows:
>
>    ENCODING = 'euc-jp'   # Please replace 'euc-jp' by your encoding
>    def dec(s): return s.decode(ENCODING)
>    def enc(u): return u.encode(ENCODING)
>    ...
>    u = dec(s)
>    ...(some work)...
>    print enc(u)
>
> It is sometimes necessary if you allow users to use ASCII-incompatible
> encodings.  The defect is that it is 'ad hoc' and lacks consistency
> over various programs generally.

Which is why it's better to set the codec name by other means, just as
other controlling parameters of your program are set -- configuration
files, environment variables, i18n mechanisms (gettext, etc), etc.  It's
important that the codec name is set in a way that's coordinated with
the way you build or source your strings, therefore setting that name
as a literal in your code is only appropriate when the strings' source is
also from literals in your code (a frequent practice, but hardly one that
is conducive to good i18n...).


Alex






More information about the Python-list mailing list