Unicode string output

M.-A. Lemburg mal at lemburg.com
Fri Jan 26 06:22:50 EST 2001


Boudewijn Rempt wrote:
> 
> On Thu, 25 Jan 2001, M.-A. Lemburg wrote:
> >
> > Well, you can manage multi-user settings easily by providing
> > a special sitecustomize.py module, so I guess that's a non-issue.
> >
> 
> Perhaps - but it is often the application, not the user that
> knows about encodings. I can easily imagine my small Chinese
> editor to work in Big5 for someone from Taiwan, but in Unicode
> for me - an application-level default, and if we use the same
> computer, a user-level default.
> 
> > The problem with having a per application setting for the
> > default encoding is that this would cause code to be written which
> > may rely on one specific default encoding. It is much better
> > to write such assumption into the code itself than to rely on
> > default settings.
> >
> 
> But setting an application-level default _is_ putting the assumption
> in the code, instead of in the system. Now you get apps that count
> on the system default being us-ascii...

We chose US-ASCII after endless discussions on python-dev. ASCII
is compatible with most of the existing encodings out there,
so it can provide a common ground for automagical conversions
such as the ones happening behind the scenes in Python. 

We had also planned for automatically setting the default encoding
using the locale settings on the machine running the application
and choosing Latin-1 (the first 256 Unicode character points) as
default.
 
In the end, our BDFL chose ASCII as standard. The code for the
other two possibilities is still there though (see site.py and
locale.py for details).

What I meant with putting the logic into the application is that
it is usually better to make assumptions explicit in the code
you write, rather than making some assumption on the system
configuration (and the site*.py module should be considered
system settings).

I have always made a point for explicit conversions from Unicode
to an 8-bit string encoding. The reason is simple: relying on
magic to do the right is convenient but will fail badly if the
magic changes due to altered system configurations.

This is fairly easy to do in Python: just write a helper
which does the conversion and use for all Unicode<->string
conversions.

> > I agree though, that we will need to put some more work into
> > making streams and other interfaces more Unicode aware and to
> > simplify those interfaces. In theory you could e.g. redirect
> > stdin and stdout through codecs, but in practice this currently
> > doesn't work well together with "print" due to the complicated
> > machinery in the Python interpreter.
> >
> 
> One other example is PyQt - the conversion between QString (Unicode
> in itself) and Python strings goes via the default codec, meaning
> us-ascii for most installations. Again, technical details seem
> to make it impossible or at least difficult to alter that, but setting
> an application level default makes it easy to work with.

This is a problem of PyQt not the Python Unicode implementation.
Extensions which apply these conversions can easily choose 
different conversions at their liking. I have made the Unicode
API in Python very flexible in this direction for exactly this
reason.

Please write to the PyQt maintainers about this problem.

> > The same goes for making the Python standard lib more Unicode
> > aware. There's still a lot to do ... we could need some sponsors :)
> >
> 
> I'd help if I could, but currently I'm a bit overworked as it is,
> with completing Kura, writing something publicable on Jython, and
> of course, coding Java for a living. On the whole, though, I'm pretty
> content with the Python Unicode support - and I've never had a
> problem with my sitecustomize.py file and the saved copy of
> setdefaultencoding. But then, I don't switch encodings during the
> lifetime of the application.

... and that's what we had in mind when making the API available
in site.py only. 

Perhaps we should think about easing the restriction a bit
to make the setting a one-shot operation: you could then call
sys.setdefaultencoding() at most once in any run of a Python
application and then only before using Unicode in the application.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




More information about the Python-list mailing list