I like Unicode more than I used to...

Alex Martelli aleaxit at yahoo.com
Tue Feb 25 06:13:52 EST 2003


On Tuesday 25 February 2003 11:43 am, SUZUKI Hisao wrote:
> In message <200302250800.18315.aleaxit at yahoo.com>, Alex Martelli wrote:
> > > > Yes, on a site-wide basis: see the file site.py in your site-packages
> > > > directory.  You can either change the blocks guarded by "if 0:" in
> > > > that file itself, or add a sitecustomize.py file in the same
> > > > directory that calls sys.setdefaultencoding('utf-8') with the
> > > > same effect.
>
> You must just have mistyped...
> The file site.py is not in your site-packages directory.
> It is in the parent directory of your site-packages directory.

Yes, you're right.


> > > The file "site.py" is in "../site-packages" while you can put your
> > > "sitecustomize.py" in "site-packages".
> >
> > Not sure I understand what you mean.  What's the leading .. in
> > that "../site-packages" you're claiming site.py is in?!
>
> The leading '..' means the parent directory, you know.

Yes, and ../site-packages means "the directory site-packages that
is a ``sibling'' of this one" (i.e., which has the same parent).  In no
OS I know does "../site-packages" mean "the parent directory of
site-packages" -- so you meant site-packages/.. and you mistyped
too, I assume.


> > > I wonder you said it from your actual experiences if you don't mind me
> > > saying so.  My experiences tell me another way.
> >
> > It's from (modest) actual experience: typically programs assuming
> > 'latin-1' (well I _am_ from Western Europe after all) because of reliance
> > on site.py, failing to run on sites (e.g. in Eastern Europe) that had not
> > tampered with site.py or had other iso-8859-* variants set in site.py.
>
> I see.  I'm afraid you just have a limited experience.  The
> latin-1 encoding is the only encoding in which you can write
> Unicode literals directly in the current Python.

Assuming you mean 2.2 (2.3 has an explicit control of encoding, no?),
who said anything about "writing Unicode literals directly" as opposed
to writing a string literal (encoded in whatever way) and then calling
decode on it?


> > > For example, what if I write my program as follows?
> > >
> > >    u = s.decode('euc-jp')
> > >    ...(some work)...
> > >    print u.encode('euc-jp')
> > >
> > > Perhaps you cannot use it.  What if I write it as follows?
> >
> > Depending on what is s, i.e. where it comes from, this should work
> > anywhere.
>
> No, 'euc-jp' (Extended Unix Code-Japan) codec is not provided by
> default regrettably :-(

You're right -- it works with my installation of Python 2.3a2 on Linux, 
but not with that of 2.2.2, nor with 2.3a2 on Windows-98.
So, we need to correct "anywhere" into "anywhere with a sufficiently
advanced release of Python (or the needed add-ons installed)".

By the same token, it wouldn't work on ANY older Python that did not
support the decode method of string objects, for example.  So, such
kinds of qualifications are always neeed.


> > Of course, this depends on the contents of s BEING
> > encoded as euc-jp in the first place.  If you don't know when you
> > write your code what encoding may be in use, the right approach
> > is of course:
> >     u = s.decode(encoding)
> > and so on, where encoding is a module global variable that you
> > set on startup (e.g. from a configuration file, environment, ...) in
> > a way that's compatible with the way s is sourced/built up.
>
> You just repeat what I have described more elaborately, don't you?

I saw nothing about configuration files &c in your mail.


> By the way, are you likely to work with euc-jp data everyday?

The encoding I had to use last time I prepared an app for use in 
Japan was SJIS, not EUC -- but, what difference does it make?

> I'd think the 'default' is there to keep the daily work simple.

As long as you're aware, exactly as I said right at the start, that
this is likely to give problems when you prepare packages to be
distributed to other sites.

> Thus
>
> > > Relying
> > > on the default is a fairly good practice for many applications.

If you keep repeating yourself, I guess so can I:

As long as you're aware, exactly as I said right at the start, that
this is likely to give problems when you prepare packages to be
distributed to other sites.

If you want, I can copy and paste this a few more times.


> Anyway relying on the default is the same as you write, say,
>
>     u = s.decode(encoding)
>
> where encoding = sys.getdefaultencoding().  There is no
> essential difference actually.  So keep it simple for daily

There is a huge difference: you, as author of packages or
even applications, do not and cannot control what encoding
is set on a sitewide basis at other sites to which you may
want to distribute your packages or applications.

When your package or application uses an explicit codec
name, as in the above example, then you DO control how
that name is determined.

Thus, using an explicit codec name (and providing appropriate
means to set it, of course) makes your packages, or
applications, less problematic for distribution to other sites
that you do not control.

Obviously, if you want, you can perfectly well _default_ to
    encoding = sys.getdefaultencoding()
if the configuration file, i18n mechanisms, or other means
yet, do not indicate that the user needs or wants to use
some other codec this time.  But forcing all user sites to
setdefaultencoding to the codec they need to use in order
to utilize YOUR application or package is unwise and is
likely to give you problems down the road.


> Of course, if the code is specific to some particular encodings
> then the explicitness is preferable.

There will always be "specificity", depending on the way the
strings are sourced or built.  I've snipped the rest of your mail,
on which we agree, about literals using specified encodings in
your sources (by whatever means).  But the crux of our abiding
disagreement is: I do not think relying on the default encoding
on sites you do not control is ever appropriate.  And there is
no case of "universality" which would make it appropriate.


Alex






More information about the Python-list mailing list