Why asci-only symbols?

Sun Oct 16 23:32:31 EDT 2005

On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:

>Bengt Richter wrote:
>> Perhaps string equivalence in keys will be treated like numeric equivalence?
>> I.e., a key/name representation is established by the initial key/name binding, but
>> values can be retrieved by "equivalent" key/names with different representations
>> like unicode vs ascii or latin-1 etc.?
>
>That would require that you know the encoding of a byte string; this
>information is not available at run-time.
>
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis' 

?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified? I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings, but we are talking
about future stuff here ;-)

>You could also try all possible encodings to see whether the strings
>are equal if you chose the right encoding for each one. This would
>be both expensive and unlike numeric equivalence: in numeric 
>equivalence, you don't give a sequence of bytes all possible
>interpretations to find some interpretation in which they are
>equivalent, either.
>
Agreed, that would be a mess.

>There is one special case, though: when comparing a byte string
>and a Unicode string, the system default encoding (i.e. ASCII)
>is assumed. This only really works if the default encoding
>really *is* ASCII. Otherwise, equal strings might not hash
>equal, in which case you wouldn't find them properly in a
>dictionary.
>
Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis' 

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
    s = u'Martin Löwis'.encode('latin-1')
would get
    s.encoding == 'latin-1'
not
    s.encoding == None
so that the encoding information could make
    print s
mean
    print s.decode(s.encoding)
(which of course would re-encode to the output device encoding for output, like current
print s.decode('latin-1') and not fail like the current default assumption for s encoding
which is s.encoding==None, i.e., assume default, which is likely print s.decode('ascii'))

Hm, probably
    s.encode(None)
and
    s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

Regards,
Bengt Richter