[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)
Peter Funk
pf@artcom-gmbh.de
Wed, 5 Apr 2000 12:42:37 +0200 (MEST)
Hi!
[me]:
> > From my POV (using ISO Latin-1 all the time) it would be
> > "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a
> > Python source file so that (u'äöü' == 'äöü') == 1. This is what I see
> > on *my* screen, whether there is a 'u' in Front of the string or not.
M.-A. Lemburg:
> u"äöü" is being interpreted as Latin-1. The problem is the
> string 'äöü' to the right: during coercion this string is
> being interpreted as UTF-8 and this causes the failure.
>
> You could say: ok, all my strings use Latin-1, but that would
> introduce other problems... esp. when you take different
> modules with different encoding assumptions and try to
> integrate them into an application.
Okay. This wouldn't occur here but we have deal with this possibility.
> > In dist/src/Misc/unicode.txt you wrote:
> >
> > > Note that you should provide some hint to the encoding you used to
> > > write your programs as pragma line in one the first few comment lines
> > > of the source file (e.g. '# source file encoding: latin-1').
[me]:
> > The upcoming 1.6 documentation should probably clarify whether
> > the interpreter pays attention to "pragma"s or not.
> > This is otherwise misleading.
>
> This "pragma" is nothing more than a hint for the source code
> reader to switch his viewing encoding. The interpreter doesn't
> treat the file differently. In fact, Python source code is
> supposed to tbe 7-bit ASCII !
Sigh. In our company we use 'german' as our master language so
we have string literals containing iso-8859-1 umlauts all over the place.
Okay as long as we don't mix them with Unicode objects, this doesn't
hurt anybody.
What I would love to see, would be a well defined way to tell the
interpreter to use 'latin-1' as default encoding instead of 'UTF-8'
when dealing with string literals from our modules.
The tokenizer in Python 1.6 already contains smart logic to get the
size of TABs right (pasting from tokenizer.c):
/* Skip comment, while looking for tab-setting magic */
if (c == '#') {
static char *tabforms[] = {
"tab-width:", /* Emacs */
":tabstop=", /* vim, full form */
":ts=", /* vim, abbreviated form */
"set tabsize=", /* will vi never die? */
/* more templates can be added here to support other editors */
};
..
It wouldn't be to hard to add something there to recognize
other "pragma" comments like for example:
#content-transfer-encoding: iso-8859-1
But what to do with it? May be adding a default encoding to every string
object? Is this bloat? Just an idea.
Regards, Peter