diferences between 22 and python 23

Sun Dec 7 04:08:37 EST 2003

bokr at oz.net (Bengt Richter) writes:

> Ok, I'm happy with that. But let's see where the errors come from.
> By definition it's from associating the wrong encoding assumption
> with a pure byte sequence. 

Wrong. Errors may also happen when performing unexpected conversions
from one encoding to a different one.

> 1a. Available unambiguous encoding information not matching the
>     default assumption was dropped. This is IMO the most likely.
> 1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
>     This is probably a bug or application design flaw, not a python problem.

1b. is the most likely case. Any byte stream read operation (file,
socket, zipfile) will return byte streams of unspecified encoding.

> IMO a large part of the answer will be not to drop available
> encoding info.

Right. And this is very difficult, making the entire approach
unimplementable.

> I hope an outline of what I am thinking is becoming visible.

Unfortunately, not. You seem to assume that nearly all strings have
encoding information attached, but you don't explain where you expect
this information to come from.

> >As I said: What would be the meaning of concatenating strings, if both
> >strings have different encodings?
> If the strings have encodings, the semantics are the semantics of character
> sequences with possibly heterogenous representations. 

??? What is a "possibly heterogenous representation", how do I
implement it, and how do I use it?

Are you suggesting that different bytes in a single string should use
different encodings? If not, how does suggesting a heterougenous
implementation answer the question of how concatenation of strings is
implemented?

> The simplest thing would probably be to choose utf-16le like windows
> wchar UIAM and normalize all strings that have encodings to that

Again: How does that answer the question what concatenation of strings
means?

Also, if you use utf-16le as the internal encoding of byte strings,
what is the meaning of indexing? I.e. given a string s='Hallo',
what is len(s), s[0], s[1]?

> Instead, the latter could become explicit, e.g., by a string prefix. E.g.,
> 
>      a'...'
>
> meaning a byte string represented by ascii+escapes syntax like
> current practice (whatever the program source encoding. I.e.,
> latin-1 non-ascii characters would not be allowed in the literal
> _source_ representation even if the program source were encoded in
> latin-1. (of course escapes would be allowed)).

Hmm. This still doesn't answer my question, but now you are extending
the syntax already.

> IWT .coding attributes/properties would permit combining character
> strings with different encodings by promoting to an encoding that
> includes all without information loss.

No, it would not - atleast not unless you specify further details.  If
I have a latin-1 string ('\xf6'), and a koi-8r string ('\xf6'), and
concatenate them, what do get?

> Of course you cannot arbitrarily combine byte strings b (b.coding==None)
> with character strings s (s.coding!=None).

So what happens if you try to combine them?

> >2. Convert the result string to UTF-8. This is incompatible with
> >   earlier Python versions.
>     Or utf-16xx. I wonder how many mixed-encoding situations there
>     are in earlier code.  Single-encoding should not require change
>     of encoding, so it should look like plain concatenation as far
>     as the byte sequence part is concerned. It might be mostly
>     transparent.

This approach is incompatible with earlier Python versions even for a
single encoding. If I have a KOI-8R s='\xf6' (which is the same as
U+0416), and UTF-16 is the internal represenation, and I do s[0], what
do I get, and what algorithm is used to compute that result?

>     socket_or_file.read().coding => None
> 
> unless some encoding was specified in the opening operation.

So *all* existing socket code would get byte strings, and so would all
existing file I/O. You will break a lot of code.

> Remember, if there is encoding, we are semantically dealing with
> character sequences, so splicing has to be implemented in terms of
> characters, however represented.

You never mentioned that you expect indexing to operate on characters,
not bytes. That would be incompatible with current Python, so I was
assuming that you could not possibly suggest that approach.

If I summarize your approach:
- conversion to an internal represenation based on UTF-16
- indexing based on characters, not bytes

I arrive at the current Unicode type. So what you want is already
implemented, except for the meaningless 'coding' attribute (it is
meaningless, as it does not describe a property of the string object).

> >No, in current Python, there is no doubt about the semantics: We
> >assume *nothing* about the encoding. Instead, if s1 and s2 are <type
>  ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?

By mistake, IMO. Marc-Andre Lemburg suggested this as a generalization
of Unicode encodings, allowing arbitrary objects to be encoded - he
would have considered (3).encode('decimal') a good idea. With the
current encode method on string objects, you can do things like
s.encode('base64').

> That's supposed to go from character entities to bytes, I thought ;-)

In a specific case of character codecs, yes. However, this has
(unfortunately) been generalized to arbitrary two-way conversion
between arbitrary things.

> Which is why I thought some_string.coding attributes to carry that
> information explicitly would be a good idea.

Yes, it sounds like a good idea. Unfortunately, it is not
implementable in a meaningful way.

Regards,
Martin