[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)

M.-A. Lemburg mal@lemburg.com
Wed, 05 Apr 2000 17:04:31 +0200


Guido van Rossum wrote:
> 
> > Sigh.  In our company we use 'german' as our master language so
> > we have string literals containing iso-8859-1 umlauts all over the place.
> > Okay as long as we don't mix them with Unicode objects, this doesn't
> > hurt anybody.
> >
> > What I would love to see, would be a well defined way to tell the
> > interpreter to use 'latin-1' as default encoding instead of 'UTF-8'
> > when dealing with string literals from our modules.
> 
> It would be better if this was supported for u"..." literals, so that
> it was taken care of at the source code level completely.  The running
> program shouldn't have to worry about what encoding its source code
> was!

u"..." currently interprets the characters it finds as Latin-1
(this is by design, since the first 256 Unicode ordinals map to
the Latin-1 characters).
 
> For 8-bit literals, this would mean that if you had source code using
> Latin-1, the literals would be translated from Latin-1 to UTF-8 by the
> code generator.  This would mean that len('ç') would return 2.  I'm
> not sure this is a great idea -- but then I'm not sure that using
> Latin-1 in source code is a great idea either.
>
> > The tokenizer in Python 1.6 already contains smart logic to get the
> > size of TABs right (pasting from tokenizer.c): ...
> 
> Before we go any further we should design pragmas.  The current
> approach is inefficient and only designed to accommodate
> editor-specific magical commands.
> 
> I say it's a Python 1.7 issue.

Good idea :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/