[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)

M.-A. Lemburg mal@lemburg.com
Wed, 05 Apr 2000 13:28:58 +0200


Peter Funk wrote:
> 
> Hi!
> 
> [me]:
> > > From my POV (using ISO Latin-1 all the time) it would be
> > > "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a
> > > Python source file so that (u'äöü' == 'äöü') == 1.  This is what I see
> > > on *my* screen, whether there is a 'u' in Front of the string or not.
> 
> M.-A. Lemburg:
> > u"äöü" is being interpreted as Latin-1. The problem is the
> > string 'äöü' to the right: during coercion this string is
> > being interpreted as UTF-8 and this causes the failure.
> >
> > You could say: ok, all my strings use Latin-1, but that would
> > introduce other problems... esp. when you take different
> > modules with different encoding assumptions and try to
> > integrate them into an application.
> 
> Okay.  This wouldn't occur here but we have deal with this possibility.
> 
> > > In dist/src/Misc/unicode.txt you wrote:
> > >
> > > > Note that you should provide some hint to the encoding you used to
> > > > write your programs as pragma line in one the first few comment lines
> > > > of the source file (e.g. '# source file encoding: latin-1').
> 
> [me]:
> > > The upcoming 1.6 documentation should probably clarify whether
> > > the interpreter pays attention to "pragma"s or not.
> > > This is otherwise misleading.
> >
> > This "pragma" is nothing more than a hint for the source code
> > reader to switch his viewing encoding. The interpreter doesn't
> > treat the file differently. In fact, Python source code is
> > supposed to tbe 7-bit ASCII !
> 
> Sigh.  In our company we use 'german' as our master language so
> we have string literals containing iso-8859-1 umlauts all over the place.
> Okay as long as we don't mix them with Unicode objects, this doesn't
> hurt anybody.
> 
> What I would love to see, would be a well defined way to tell the
> interpreter to use 'latin-1' as default encoding instead of 'UTF-8'
> when dealing with string literals from our modules.
> 
> The tokenizer in Python 1.6 already contains smart logic to get the
> size of TABs right (pasting from tokenizer.c):
> 
>         /* Skip comment, while looking for tab-setting magic */
>         if (c == '#') {
>                 static char *tabforms[] = {
>                         "tab-width:",           /* Emacs */
>                         ":tabstop=",            /* vim, full form */
>                         ":ts=",                 /* vim, abbreviated form */
>                         "set tabsize=",         /* will vi never die? */
>                 /* more templates can be added here to support other editors */
>                 };
> ..
> 
> It wouldn't be to hard to add something there to recognize
> other "pragma" comments like for example:
>         #content-transfer-encoding: iso-8859-1
> But what to do with it?  May be adding a default encoding to every string
> object?  Is this bloat?  Just an idea.

As I have already indicated above this would only solve
the problem of string literals in Python source code.
It would not however solve the problem with strings in general,
since these can be built dynamically or from user input.

The only way I can see for #pragma to work here is by auto-
converting all static strings in the source code to Unicode
and that would probably break more code than do good. Even
worse, writing 'abc' in such a program would essentially
mean the same thing as u'abc'.

I'd suggest turning your Latin-1 strings into Unicode...
this will hurt at first, but in the long rung, you win.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/