[Python-Dev] Python in Unicode context

Tue Aug 3 21:35:16 CEST 2004

[Martin von Löwis]

> François Pinard wrote:

> > maybe some kind of `module.__coding__' next to `module.__file__',
> >saving the coding effectively used while compilation was going on.

> That would be possible to implement.  Feel free to create a patch.

I might try, and it would be my first Python patch.  But please, please
tell me if the idea is not welcome, as my free time is rather short and
I already have a lot of things waiting for me! :-).

> >I wonder if some other cookie, next to the `coding:' cookie, could
> >not be used to declare that all strings _in this module only_ should
> >be interpreted as Unicode by default, but without the need of
> >resorting to `u' prefix all over.

> [...] if you know a syntax which you like, propose a patch.  Be
> prepared to also write a PEP defending that syntax.

Surely no particular syntax that I like enough for defending it.
Anything reasonable would do as far as I am concerned, so I might
propose a reasonable patch without involving myself into a crusade.
Yet I may try to assemble and edit together the ideas of others, if it
serves a purpose.

> >Right now, my feeling is that Python asks a bit too much of a
> >programmer, in terms of commitment, if we only consider the editing
> >work required on sources to use it, or not.

> Not sure what you are referring here to.

There is currently a lot of effort involved in Python so Unicode strings
and usual strings inter-operate correctly and automatically, also hiding
as much as reasonable to the unwilling user whether if characters are
large or narrow: s/he uses about the same code no matter what.  The way
Python does is rather lovely, in fact. :-)

I'm going to transform a flurry of Latin-1 Python scripts to UTF-8, but
not all of them, as I'm not going to impose Unicode in our team where
it is not wanted.  For French, and German and many others, we have
been lucky enough for having one codepoint per character in Unicode,
so we can hope that programs assuming that S[N] addresses the N'th
(0-based) character of string S will work the same way irrelevant of if
strings are narrow or wide.  However, and I shall have the honesty to
state it, this is *not* respectful of the general Unicode spirit: the
Python implementation allows for independently addressable surrogate
halves, combining zero-width diacritics, normal _and_ decomposed forms,
directional marks, linguistic marks and various other such complexities.

But in our case, where applications already work in Latin-1, abusing our
Unicode luck, UTF-8 may _not_ be used as is, we ought to use Unicode or
wide strings as well, for preserving S[N] addressability.  So changing
source encodings may be intimately tied to going Unicode whenever UTF-8
(or any other variable-length encoding) gets into the picture.

> You do have the choice of source encodings, and, in fact, "Unicode"
> is not a valid source encoding.  "UTF-8" is [...]

Guess that I know! :-) :-)

> [...] from a Python point of view, there is absolutely no difference
> between [UTF-8] and, say, "ISO-8859-15".  Choice of source encoding
> is different from the choice of string literals.  You can use Unicode
> strings, or byte strings, or mix them.  It really is your choice.

I hope that my explanation above helps at seeing that source encoding
and choice of string literals are not as independent as one may think.
A choice that I surely do _not_ have is to see bugs appear in programs
merely because I changed the source encoding.  Going from ISO 8859-1 to
ISO 8859-15 for a Python source is probably fairly safe, because there
is no need for switching the narrowness of strings.  Going from ISO
8859-1 to UTF-8 is very unsafe, and editing all literal strings from
narrow to wide, using `u' prefixes, becomes almost unavoidable.

There ought to be a way to maintain a single Python source that would
work dependably through re-encoding of the source, but not uselessly
relying on wide strings when there is no need for them.  That is,
without marking all literal strings as being Unicode.  Changing encoding
from ISO 8859-1 to UTF-8 should not be a one-way, no-return ticket.

Of course, it is very normal that sources may have to be adapted for the
possibility of a Unicode context.  There should be some good style and
habits for writing re-encodable programs.  So this exchange of thoughts.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard