diferences between 22 and python 23

Thu Dec 4 15:53:16 EST 2003

On 04 Dec 2003 20:22:52 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> Still, the actual characters used in the _source_ representation will have to
>> be whatever the -*- xxx -*- thing says, right? -- including the characters
>> in the source representation of a string that might wind up utf-8 internally?
>
>Yes, and no. Yes, characters in the source code have to follow the
>source representation. No, they will not wind up utf-8
>internally. Instead, (byte) string objects have the same byte
>representation that they originally had in the source code.
Then they must have encoding info attached?

>
>The source declaration only matters in the following respects:
>- the source may be erroneous, if the bytes form illegal encodings
>  in the declared source encoding.
>- a unicode object will be created based upon the source encoding,
>  by decoding the bytes in the unicode literal.
>- the meaning of certain bytes might not be what it would be in
>  ASCII. In particular, byte 92 does not always denote a 
>  backslash (\), in all encodings. As a result, if byte 92 appears
>  in a string literal, the end of the string literal might depend
>  on the encoding.
>
>> >The byte string type is not going away. It is a useful type, e.g. when
>> >reading or writing to or from a byte stream.
>> >
>> Is this moving towards a single 8-bit str base type with various
>> encoding-specifying subtypes?
>
>I don't think so. If byte strings where tagged with encoding, you
>have to answer many difficult questions, like "what is the result
>of adding strings with different encodings?", or "what encoding tag
Isn't that similar to promotion in 123 + 4.56 ? We already do that to some extent:
 >>> 'abc' + u'def'
 u'abcdef'

IOW, behind the concrete character representations there are abstract entities
(which the unicode charts systematically match up with other abstract entities
from the integer domain), so in the abstract we are representing the concatenation
of abstract character entities of the same universal type (i.e., belonging
to the set of possible characters). The question becomes what encoding is adequate
to represent the result without information loss.

There could even be analogies to roundoff in e.g. dropping accent marks during
some conversion.

But there is another question, and that is whether a concrete encoding of characters
really just represents characters, or whether the intent is actually to represent
a concrete encoding as such (including the info as to which encoding it is). In the
latter case one couldn't convert to a universal character type without loss of information.

IOW, ISTM for literals one would need a way to say:

1. This is a pure character sequence, use the source representation only to determine
   what the abstract character entities (ACEs) are, and represent them as necessary to preserve
   their unified identities.
2. This is a quote-delimited substring of the source text, use the source encoding cookie
   or other governing assumption to determine what the ACEs are, then as in 1.
3. This is an encoding-restricted string literal (though necessarily represented in the concrete
   character encoding of the module source, with escapes as necessary). Determine what the ACEs are,
   using the encoding information to transform as necessary, but store encoding information along
   with with ACE representation, because the programming intent is to represent encoding information
   as well as ACE sequence.

3a. Alternatively, store the original _source_ as an ACE sequence with associated _source_ encoding
    AND encoding called for by the literal. This is tricky to think about, because there are >= three
    encodings to consider -- the source, what's called for by the literal, and possible internal
    representations.

>has a string returned from a socket read", etc.
8-bit byte encoding by default, I would think, but if you expand on the idea of cooked
text input, I guess you could specify an encoding much as you specify 'r' vs 'rb' vs 'rU' etc.

BTW, for convenience, will 8-bit byte encoded strings be repr'd as latin-1 + escapes?

>
>Instead, applications should apply encodings whereever needed, using
>Unicode strings for character data, and byte strings for binary data.
>
Still, they have to express that in the encoding(s) of the program sources,
so what will '...' mean? Must it not be normalized to a common internal representation?

BTW, does import see encoding cookies and do the right thing when there are differing ones?

Regards,
Bengt Richter