diferences between 22 and python 23

Sun Dec 7 13:34:32 EST 2003

On 07 Dec 2003 10:08:37 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> Ok, I'm happy with that. But let's see where the errors come from.
>> By definition it's from associating the wrong encoding assumption
>> with a pure byte sequence. 
>
>Wrong. Errors may also happen when performing unexpected conversions
>from one encoding to a different one.
ISTM that could only happen e.g. if you explicitly called codecs to
convert between incompatible encodings. That is normal, just like

 >>> float(10**309)
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 OverflowError: long int too large to convert to float

is a normal result. Otherwise "unexpected conversions" must happen
because of some string expression involving strings of different encodings,
in which case it is like

 >>> 10**308 * 10.0
 1.#INF

(Which BTW could be argued (but I won't ;-) should also be the result of float(10**309)).
So here is a case where information was lost, because there was no information-preserving
representation available. If we used an exact numeric form, e.g., and exact decimal I was
experimenting with, we can do

 >>> from ut.exactdec import ED
 >>> ED(10**308) * ED(10.0, 'all')
 ED('1.0e309')

(The 'all' is a rounding parameter indication to capture all available accuracy bits
from a floating point arg to the constructor and call the result exact. Integers or longs
are naturally exact, so don't require that parameter).

Anyway, that's analogous to an expression involving, e.g., 

    s1 = u'abc'.encode('utf-8')
and
    s2 = u'def'.encode('latin-1')

In my scenario, you would have

    assert s1.coding == 'utf-8'
    assert s1.bytes() == a'abc' # bytes() gets the encoded byte sequence as pure str bytes
    assert s2.coding == 'latin-1'
    assert s2.bytes() == a'def'

so
    s3 = s1 + s2

would imply

    s3 = (1.bytes().decode(s1.coding) + s2.bytes().decode(s2.coding)).encode(cenc(s1.coding, s2.coding))

where lcenc is a function something like (sketch)

    def cenc(enc1, enc2):
        """return common encoding"""
        if enc1==enc2: return enc1 # this makes latin-1 + latin-1 => latin-1, etc.
        if enc1 is None: enc1 = 'ascii' # notorios assumption ;-)
        if enc2 is None: enc2 = 'ascii' # ditto
        if enc1[:3] == 'utf': return enc1 # preserve unicode encoding format of s1 preferentially
        return 'utf' # generic system utf encoding

which in the above example would get you

    assert s3.coding == 'utf-8'

>
>> 1a. Available unambiguous encoding information not matching the
>>     default assumption was dropped. This is IMO the most likely.
>> 1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
>>     This is probably a bug or application design flaw, not a python problem.
>
>1b. is the most likely case. Any byte stream read operation (file,
>socket, zipfile) will return byte streams of unspecified encoding.
But this is not an error. An error would only arise if one tried to use
the bytes as characters without specifying a decoding.

>
>> IMO a large part of the answer will be not to drop available
>> encoding info.
>
>Right. And this is very difficult, making the entire approach
>unimplementable.
ISTM it doesn't have to be all or nothing.

>
>> I hope an outline of what I am thinking is becoming visible.
>
>Unfortunately, not. You seem to assume that nearly all strings have
>encoding information attached, but you don't explain where you expect
>this information to come from.
Strings appearing as literals in program sources will be assumed to have
the same encoding as is assumed or explicitly specified for the source text.
IMO that will cover a lot of strings not now covered, and will be an improvement
even if it doesn't cover everything.

>
>> >As I said: What would be the meaning of concatenating strings, if both
>> >strings have different encodings?
>> If the strings have encodings, the semantics are the semantics of character
>> sequences with possibly heterogenous representations. 
>
>??? What is a "possibly heterogenous representation", how do I
>implement it, and how do I use it?
See example s3 = s1 + s2 above.

>
>Are you suggesting that different bytes in a single string should use
>different encodings? If not, how does suggesting a heterougenous
>implementation answer the question of how concatenation of strings is
>implemented?
See as before.

>
>> The simplest thing would probably be to choose utf-16le like windows
>> wchar UIAM and normalize all strings that have encodings to that
>
>Again: How does that answer the question what concatenation of strings
>means?
See as before.

>
>Also, if you use utf-16le as the internal encoding of byte strings,
>what is the meaning of indexing? I.e. given a string s='Hallo',
>what is len(s), s[0], s[1]?
If s.coding is None, it's the same as now. Otherwise

    len(s) <-> len(s.decode(s.coding))

e.g.

 >>> s = 'L\xf6wis'
 >>> s8 = s.decode('latin-1').encode('utf-8')
 >>> s8
 'L\xc3\xb6wis'
 >>> len(s8)
 6
 >>> len(s)
 5
 >>> len(s8.decode('utf-8'))
 5

s[0] and s[1] create new encoded strings if they are indexing encoded strings,
and preserve the .coding info. So e.g., in general, when .coding is not None,

    s[i] <-> s.decode(s.coding)[i].encode(s.coding)

(This is semantics, let's not prematurely talk about optimization ;-)

 >>> s8    # current display of the bytes of utf-8 encoding in s8
 'L\xc3\xb6wis'
 >>> s8[0] # wrong when .coding is not None
 'L'
 >>> s8_0 = s8.decode('utf-8')[0].encode('utf-8')
 >>> s8_1 = s8.decode('utf-8')[1].encode('utf-8')
 >>> s8
 'L\xc3\xb6wis'
 >>> s8_0
 'L'
 >>> s8_1
 '\xc3\xb6'

and assert s8_0.coding == s8_1.coding == 'utf-8' would hold for results.

>
>> Instead, the latter could become explicit, e.g., by a string prefix. E.g.,
>> 
>>      a'...'
>>
>> meaning a byte string represented by ascii+escapes syntax like
>> current practice (whatever the program source encoding. I.e.,
>> latin-1 non-ascii characters would not be allowed in the literal
>> _source_ representation even if the program source were encoded in
>> latin-1. (of course escapes would be allowed)).
>
>Hmm. This still doesn't answer my question, but now you are extending
>the syntax already.
>
>> IWT .coding attributes/properties would permit combining character
>> strings with different encodings by promoting to an encoding that
>> includes all without information loss.
>
>No, it would not - atleast not unless you specify further details.  If
>I have a latin-1 string ('\xf6'), and a koi-8r string ('\xf6'), and
>concatenate them, what do get?
An sequence of bytes that is an encoding of the _character_ sequence, such that
the encoding is adequate to represent all the characters, e.g.,

 >>> latkoi = ('\xf6'.decode('latin-1') + '\xf6'.decode('koi8_r')).encode('utf')
 >>> latkoi
 '\xc3\xb6\xd0\x96'

with assert latkoi.coding == 'utf-8' since 'utf' seems to be an alias for 'utf-8'
And since .coding is not None, the resulting string length is

 >>> len(latkoi.decode('utf-8'))
 2

>
>> Of course you cannot arbitrarily combine byte strings b (b.coding==None)
>> with character strings s (s.coding!=None).
>
>So what happens if you try to combine them?
>
>> >2. Convert the result string to UTF-8. This is incompatible with
>> >   earlier Python versions.
>>     Or utf-16xx. I wonder how many mixed-encoding situations there
>>     are in earlier code.  Single-encoding should not require change
>>     of encoding, so it should look like plain concatenation as far
>>     as the byte sequence part is concerned. It might be mostly
>>     transparent.
>
>This approach is incompatible with earlier Python versions even for a
>single encoding. If I have a KOI-8R s='\xf6' (which is the same as
>U+0416), and UTF-16 is the internal represenation, and I do s[0], what
>do I get, and what algorithm is used to compute that result?
if you have s='\xf6' representing a KOI-8R character, you have two pieces of info.
The bare bytes would have s.coding == None. (I suggested a literal format a'\xf6' for pure bytes,
so let's say
    s = a'\xf6'
but you want that interpreted as KOI-8R, so we have to decode it according to that,
and then re-encode to get a byte string with .coding set:

    s = s.bytes().decode('koi8_r').encode('koi8_r')

e.g.
 >>> '\xf6'.decode('koi8_r').encode('koi8_r')
 '\xf6'

(which seems like it could be optimized ;-)

But you mention an "internal representation" of UTF-16. I'm not sure what you mean,
(though I assume unicode is internally handled for u'...' strings in utf-16le/wchar_t
format in most PCs) except you could certainly have a string with that explicit .coding
format, e.g.,

 >>> s16 = '\xf6'.decode('koi8_r').encode('utf-16')
 >>> s16
 '\xff\xfe\x16\x04'

and then assert s16.coding == 'utf-16' would pass ok.
Note the BOM. Still the length (since s16.coding is not None) is

 >>> len(s16.decode('utf-16'))
 1

If s.coding is None, you could say length was len(s.bytes().decode('bytes')) and say
it was optimized away, I suppose.

In other words, when s.coding is not None, you can think of all the possibilities
as alternative representations of s.bytes().decode(s.coding) where .bytes() is a method
to get the raw str bytes of the particular encoding, and even if s.coding is None, you
could use the virtual 'bytes' character set default assumption, so that all strings
have a character interpretation if needed.

>
>>     socket_or_file.read().coding => None
>> 
>> unless some encoding was specified in the opening operation.
>
>So *all* existing socket code would get byte strings, and so would all
>existing file I/O. You will break a lot of code.
Why?
>
>> Remember, if there is encoding, we are semantically dealing with
>> character sequences, so splicing has to be implemented in terms of
>> characters, however represented.
>
>You never mentioned that you expect indexing to operate on characters,
>not bytes. That would be incompatible with current Python, so I was
>assuming that you could not possibly suggest that approach.
It would normally be transparent, I think, and mostly optimize away,
except where you have programs with mixed encodings.

>
>If I summarize your approach:
>- conversion to an internal represenation based on UTF-16
Only as needed. Encodings would not arbitrarily be changed.

>- indexing based on characters, not bytes
if the bytes represent an encoded character sequence as indicated by s.coding, yes,
but the result is another (sub)string with the same s.coding, and might be encoded
as a single byte or several (as sometimes in utf-8). I.e., from above, semantically:

    s[i] <-> s.decode(s.coding)[i].encode(s.coding) # with result having same .coding value

>
>I arrive at the current Unicode type. So what you want is already
>implemented, except for the meaningless 'coding' attribute (it is
>meaningless, as it does not describe a property of the string object).
>
>> >No, in current Python, there is no doubt about the semantics: We
>> >assume *nothing* about the encoding. Instead, if s1 and s2 are <type
>>  ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?
>
>By mistake, IMO. Marc-Andre Lemburg suggested this as a generalization
>of Unicode encodings, allowing arbitrary objects to be encoded - he
>would have considered (3).encode('decimal') a good idea. With the
>current encode method on string objects, you can do things like
>s.encode('base64').
I can see the usefulness of that. I guess you have to envisage a
hidden transparent s.decode('bytes') before the .encode('base64')
so we have a logical round trip. s.decode('bytes') could be conceptualized
as producing unicode in a private range U+E000 .. U+E0ff and encoding that
range of unicode "characters" could do the right thing. Re-encoding as 'bytes'
would restore the original byte sequence, just like any other 1:1 encoding
transformation. You could even design a font, like little boxes with the
hex values in them ;-)
>
>> That's supposed to go from character entities to bytes, I thought ;-)
>
>In a specific case of character codecs, yes. However, this has
>(unfortunately) been generalized to arbitrary two-way conversion
>between arbitrary things.
Well, maybe it can be rationalized via the virtual 'bytes' character encoding
(and which in some contexts might make a better default assumption than 'ascii').

>
>> Which is why I thought some_string.coding attributes to carry that
>> information explicitly would be a good idea.
>
>Yes, it sounds like a good idea. Unfortunately, it is not
>implementable in a meaningful way.
I'm still hoping for something meaningful. See above ;-)

I'm tempted to subclass str, to play with it, but not right now ;-)

Regards,
Bengt Richter