[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)

Wed, 5 Apr 2000 12:42:37 +0200 (MEST)

Hi!

[me]:
> > From my POV (using ISO Latin-1 all the time) it would be
> > "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a
> > Python source file so that (u'äöü' == 'äöü') == 1.  This is what I see
> > on *my* screen, whether there is a 'u' in Front of the string or not.

M.-A. Lemburg:
> u"äöü" is being interpreted as Latin-1. The problem is the
> string 'äöü' to the right: during coercion this string is
> being interpreted as UTF-8 and this causes the failure.
> 
> You could say: ok, all my strings use Latin-1, but that would
> introduce other problems... esp. when you take different
> modules with different encoding assumptions and try to
> integrate them into an application.

Okay.  This wouldn't occur here but we have deal with this possibility.

> > In dist/src/Misc/unicode.txt you wrote:
> >
> > > Note that you should provide some hint to the encoding you used to
> > > write your programs as pragma line in one the first few comment lines
> > > of the source file (e.g. '# source file encoding: latin-1').

[me]:
> > The upcoming 1.6 documentation should probably clarify whether
> > the interpreter pays attention to "pragma"s or not.
> > This is otherwise misleading.
> 
> This "pragma" is nothing more than a hint for the source code
> reader to switch his viewing encoding. The interpreter doesn't
> treat the file differently. In fact, Python source code is
> supposed to tbe 7-bit ASCII !

Sigh.  In our company we use 'german' as our master language so 
we have string literals containing iso-8859-1 umlauts all over the place.  
Okay as long as we don't mix them with Unicode objects, this doesn't 
hurt anybody.

What I would love to see, would be a well defined way to tell the
interpreter to use 'latin-1' as default encoding instead of 'UTF-8'
when dealing with string literals from our modules.

The tokenizer in Python 1.6 already contains smart logic to get the
size of TABs right (pasting from tokenizer.c):

        /* Skip comment, while looking for tab-setting magic */
        if (c == '#') {
                static char *tabforms[] = {
                        "tab-width:",           /* Emacs */
                        ":tabstop=",            /* vim, full form */
                        ":ts=",                 /* vim, abbreviated form */
                        "set tabsize=",         /* will vi never die? */
                /* more templates can be added here to support other editors */
                };
..

It wouldn't be to hard to add something there to recognize
other "pragma" comments like for example:
	#content-transfer-encoding: iso-8859-1
But what to do with it?  May be adding a default encoding to every string
object?  Is this bloat?  Just an idea.

Regards, Peter