[I18n-sig] Re: [Python-Dev] Re: Unicode debate

Guido van Rossum guido@python.org
Fri, 28 Apr 2000 14:31:19 -0400


> [GvR, on string.encoding ]
> >Marc-Andre took this idea a bit further, but I think it's not
> >practical given the current implementation: there are too many places
> >where the C code would have to be changed in order to propagate the
> >string encoding information,

[JvR]
> I may miss something, but the encoding attr just travels with the string
> object, no? Like I said in my reply to MAL, I think it's undesirable to do
> *anything* with the encoding attr if not in combination with a unicode
> string.

But just propagating affects every string op -- s+s, s*n, s[i], s[:],
s.strip(), s.split(), s.lower(), ...

> >and there are too many sources of strings
> >with unknown encodings to make it very useful.
> 
> That's why the default encoding must be settable as well, as Fredrik
> suggested.

I'm open for debate about this.  There's just something about a
changeable global default encoding that worries me -- like any global
property, it requires conventions and defensive programming to make
things work in larger programs.  For example, a module that deals with
Latin-1 strings can't just set the default encoding to Latin-1: it
might be imported by a program that needs it to be UTF-8.  This model
is currently used by the locale in C, where all locale properties are
global, and it doesn't work well.  For example, Python needs to go
through a lot of hoops so that Python numeric literals use "." for the
decimal indicator even if the user's locale specifies "," -- we can't
change Python to swap the meaning of "." and "," in all contexts.

So I think that a changeable default encoding is of limited value.
That's different from being able to set the *source file* encoding --
this only affects Unicode string literals.

> >Plus, it would slow down 8-bit string ops.
> 
> Not if you ignore it most of the time, and just pass it along when
> concatenating.

And slicing, and indexing, and...

> >I have a better idea: rather than carrying around 8-bit strings with
> >an encoding, use Unicode literals in your source code.
> 
> Explain that to newbies... I guess is that they will want simple 8 bit
> strings in their native encoding. Dunno.

If they are hap-py with their native 8-bit encoding, there's no need
for them to ever use Unicode objects in their program, so they should
be fine.  8-bit strings aren't ever interpreted or encoded except when
mixed with Unicode objects.

> >If the source
> >encoding is known, these will be converted using the appropriate
> >codec.
> >
> >If you object to having to write u"..." all the time, we could say
> >that "..." is a Unicode literal if it contains any characters with the
> >top bit on (of course the source file encoding would be used just like
> >for u"...").
> 
> Only if "\377" would still yield an 8-bit string, for binary goop...

Correct.

--Guido van Rossum (home page: http://www.python.org/~guido/)