diferences between 22 and python 23

Mon Dec 8 12:18:49 EST 2003

bokr at oz.net (Bengt Richter) writes:

> >bokr at oz.net (Bengt Richter) writes:
> >
> >> Ok, I'm happy with that. But let's see where the errors come from.
> >> By definition it's from associating the wrong encoding assumption
> >> with a pure byte sequence. 
> >
> >Wrong. Errors may also happen when performing unexpected conversions
> >from one encoding to a different one.
> ISTM that could only happen e.g. if you explicitly called codecs to
> convert between incompatible encodings. 

No. It could also happen when you concatenate strings with
incompatible encodings.

>    s3 = (s1.bytes().decode(s1.coding) + s2.bytes().decode(s2.coding)).encode(cenc(s1.coding, s2.coding))

So what happens if either s1.coding or s2.coding is None?

>     def cenc(enc1, enc2):
>         """return common encoding"""
>         if enc1==enc2: return enc1 # this makes latin-1 + latin-1 => latin-1, etc.
>         if enc1 is None: enc1 = 'ascii' # notorios assumption ;-)
>         if enc2 is None: enc2 = 'ascii' # ditto
>         if enc1[:3] == 'utf': return enc1 # preserve unicode encoding format of s1 preferentially
>         return 'utf' # generic system utf encoding

It would be better to call that utf-8, as utf is an unfortunate
alias...

So concatenating latin-1 and koi-8r strings would give an utf-8
string, as would concatenating an ascii string and a latin-1 string.

> But this is not an error. An error would only arise if one tried to use
> the bytes as characters without specifying a decoding.

So

print open("/etc/passwd").read()

would raise an exception???

> >Unfortunately, not. You seem to assume that nearly all strings have
> >encoding information attached, but you don't explain where you expect
> >this information to come from.

> Strings appearing as literals in program sources will be assumed to
> have the same encoding as is assumed or explicitly specified for the
> source text.  IMO that will cover a lot of strings not now covered,
> and will be an improvement even if it doesn't cover everything.

I doubt that. Operatings will decay to "no encoding" very quickly, or
give exceptions - depending on your (yet unclear) specification.

> >??? What is a "possibly heterogenous representation", how do I
> >implement it, and how do I use it?
> See example s3 = s1 + s2 above.

In what sense is the resulting representation heterogenous? ISTM that
the result uses cenc(s1.encoding, s2.encoding) as its representation.

> s[0] and s[1] create new encoded strings if they are indexing
> encoded strings, and preserve the .coding info. So e.g., in general,
> when .coding is not None,
> 
>     s[i] <-> s.decode(s.coding)[i].encode(s.coding)

So if s.coding doesn't round-trip, s[i].bytes() would not be a
substring of s.bytes(), right?

> In other words, when s.coding is not None, you can think of all the
> possibilities as alternative representations of
> s.bytes().decode(s.coding) where .bytes() is a method to get the raw
> str bytes of the particular encoding, and even if s.coding is None,
> you could use the virtual 'bytes' character set default assumption,
> so that all strings have a character interpretation if needed.

So what is the difference between this type, and the unicode type?  It
appears that indexing works all the same in your string, type, and in
the Unicode type, and instead of saying .bytes, you say .encode(encname).

> >So *all* existing socket code would get byte strings, and so would all
> >existing file I/O. You will break a lot of code.
> Why?

Because people try to combine such strings with strings with encoding
information.

You haven't specified yet what happens when you try to do this, but it
appears that you are proposing that one gets an exception.

Regards,
Martin