Why asci-only symbols?

"Martin v. Löwis" martin at v.loewis.de
Mon Oct 17 19:34:09 EDT 2005


Bengt Richter wrote:
> Well, what will be assumed about name after the lines
> 
> #-*- coding: latin1 -*-
> name = 'Martin Löwis' 
> 
> ?

Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in 
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73

> I know type(name) will be <type 'str'> and in itself contain no encoding information now,
> but why shouldn't the default assumption for literal-generated strings be what the coding
> cookie specified?

That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.

> I know the current implementation doesn't keep track of the different
> encodings that could reasonably be inferred from the source of the strings, 
 > but we are talking about future stuff here ;-)

Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?

> #-*- coding: latin1 -*-
> name = 'Martin Löwis' 
> 
> could be that name.encoding == 'latin-1'

That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.

> Functions that generate strings, such as chr(), could be assumed to create
> a string with the same encoding as the source code for the chr(...) invocation.

What is the source of the chr invocation? If I do chr(param), should I 
use the source where param was computed, or the source where the call
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?

What about the many other sources of byte strings (like strings read 
from a file, or received via a socket)?

> This is not a fully developed idea, and there has been discussion on the topic before
> (even between us ;-) but I thought another round might bring out your current thinking
> on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do 
any good with what little it could do. Just use Unicode strings.

Regards,
Martin



More information about the Python-list mailing list