diferences between 22 and python 23
Martin v. Löwis
martin at v.loewis.de
Sun Dec 7 04:08:37 EST 2003
bokr at oz.net (Bengt Richter) writes:
> Ok, I'm happy with that. But let's see where the errors come from.
> By definition it's from associating the wrong encoding assumption
> with a pure byte sequence.
Wrong. Errors may also happen when performing unexpected conversions
from one encoding to a different one.
> 1a. Available unambiguous encoding information not matching the
> default assumption was dropped. This is IMO the most likely.
> 1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
> This is probably a bug or application design flaw, not a python problem.
1b. is the most likely case. Any byte stream read operation (file,
socket, zipfile) will return byte streams of unspecified encoding.
> IMO a large part of the answer will be not to drop available
> encoding info.
Right. And this is very difficult, making the entire approach
unimplementable.
> I hope an outline of what I am thinking is becoming visible.
Unfortunately, not. You seem to assume that nearly all strings have
encoding information attached, but you don't explain where you expect
this information to come from.
> >As I said: What would be the meaning of concatenating strings, if both
> >strings have different encodings?
> If the strings have encodings, the semantics are the semantics of character
> sequences with possibly heterogenous representations.
??? What is a "possibly heterogenous representation", how do I
implement it, and how do I use it?
Are you suggesting that different bytes in a single string should use
different encodings? If not, how does suggesting a heterougenous
implementation answer the question of how concatenation of strings is
implemented?
> The simplest thing would probably be to choose utf-16le like windows
> wchar UIAM and normalize all strings that have encodings to that
Again: How does that answer the question what concatenation of strings
means?
Also, if you use utf-16le as the internal encoding of byte strings,
what is the meaning of indexing? I.e. given a string s='Hallo',
what is len(s), s[0], s[1]?
> Instead, the latter could become explicit, e.g., by a string prefix. E.g.,
>
> a'...'
>
> meaning a byte string represented by ascii+escapes syntax like
> current practice (whatever the program source encoding. I.e.,
> latin-1 non-ascii characters would not be allowed in the literal
> _source_ representation even if the program source were encoded in
> latin-1. (of course escapes would be allowed)).
Hmm. This still doesn't answer my question, but now you are extending
the syntax already.
> IWT .coding attributes/properties would permit combining character
> strings with different encodings by promoting to an encoding that
> includes all without information loss.
No, it would not - atleast not unless you specify further details. If
I have a latin-1 string ('\xf6'), and a koi-8r string ('\xf6'), and
concatenate them, what do get?
> Of course you cannot arbitrarily combine byte strings b (b.coding==None)
> with character strings s (s.coding!=None).
So what happens if you try to combine them?
> >2. Convert the result string to UTF-8. This is incompatible with
> > earlier Python versions.
> Or utf-16xx. I wonder how many mixed-encoding situations there
> are in earlier code. Single-encoding should not require change
> of encoding, so it should look like plain concatenation as far
> as the byte sequence part is concerned. It might be mostly
> transparent.
This approach is incompatible with earlier Python versions even for a
single encoding. If I have a KOI-8R s='\xf6' (which is the same as
U+0416), and UTF-16 is the internal represenation, and I do s[0], what
do I get, and what algorithm is used to compute that result?
> socket_or_file.read().coding => None
>
> unless some encoding was specified in the opening operation.
So *all* existing socket code would get byte strings, and so would all
existing file I/O. You will break a lot of code.
> Remember, if there is encoding, we are semantically dealing with
> character sequences, so splicing has to be implemented in terms of
> characters, however represented.
You never mentioned that you expect indexing to operate on characters,
not bytes. That would be incompatible with current Python, so I was
assuming that you could not possibly suggest that approach.
If I summarize your approach:
- conversion to an internal represenation based on UTF-16
- indexing based on characters, not bytes
I arrive at the current Unicode type. So what you want is already
implemented, except for the meaningless 'coding' attribute (it is
meaningless, as it does not describe a property of the string object).
> >No, in current Python, there is no doubt about the semantics: We
> >assume *nothing* about the encoding. Instead, if s1 and s2 are <type
> ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?
By mistake, IMO. Marc-Andre Lemburg suggested this as a generalization
of Unicode encodings, allowing arbitrary objects to be encoded - he
would have considered (3).encode('decimal') a good idea. With the
current encode method on string objects, you can do things like
s.encode('base64').
> That's supposed to go from character entities to bytes, I thought ;-)
In a specific case of character codecs, yes. However, this has
(unfortunately) been generalized to arbitrary two-way conversion
between arbitrary things.
> Which is why I thought some_string.coding attributes to carry that
> information explicitly would be a good idea.
Yes, it sounds like a good idea. Unfortunately, it is not
implementable in a meaningful way.
Regards,
Martin
More information about the Python-list
mailing list