diferences between 22 and python 23

Sat Dec 6 12:20:57 EST 2003

bokr at oz.net (Bengt Richter) writes:

> >> Why, when unicode includes all?
> >
> >Because at the end, you would produce a byte string. Then the question
> >is what type the byte string should have.
> Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
> or latin-1 + latin-1, etc., where the result could retain the more specific
> encoding attribute.

I meant to write "what *encoding* the byte string should have". Unicode
is not an encoding.

> Why not assume latin-1, if it's just a convenience assumption for certain
> contexts? I suspect it would be right more often than not, given that for
> other cases explicit unicode or decode/encode calls would probably be used.

This was by BDFL pronouncement, and I agree with that decision. I
personally would have favoured UTF-8 as system encoding in Python, as
it would support all languages, and would allow for as little mistakes
as ASCII (e.g. you can't mistake a Latin-1 or KOI-8R string as UTF-8).
I would consider chosing Latin-1 as euro-centric, and it would
silently do the wrong thing if the actual encoding was something else.

Errors should never pass silently.
Unless explicitly silenced.

> >name = u'Martin Löwis'
> >print name
> Right, but that is a workaround w.r.t the possibility I am trying to
> discuss.

The problem is that the possibility is not a possibility. What you
propose just cannot be implemented in a meaningful way. If you don't
believe me, please try implementing it yourself, and I'll show you the
problems of your implementation.

Using Unicode objects to represent characters is not a work-around, it
is the solution.

> Care to elaborate? I don't know what difficult questions nor
> non-intuitive behavior you have in mind, but I am probably not the
> only one who is curious ;-)

As I said: What would be the meaning of concatenating strings, if both
strings have different encodings?

I see three possible answers to this question, all non-intuitive:
1. Choose one of the encodings, and convert the other string to
   that encoding. This has these problems:
   a) neither encoding might be capable of representing all characters
      of the result string. There are several ways to deal with this
      case; finding them is left as an exercise to the reader.
   b) it would be incompatible with prior versions, as it would
      not be a plain byte concatenation.
2. Convert the result string to UTF-8. This is incompatible with
   earlier Python versions.
3. Consider the result as having "no encoding". This would render
   the entire feature useless, as string data would degrade to
   "no encoding" very quickly. This, in turn, would leave to "strange"
   errors, as sometimes, printing a string works fine, but seemingly
   randomly, it fails.

Also, what would be the encoding of strings returned from file.read(),
socket.read(), etc.?

Also, what would be the encoding of strings created as a result of
splice operations? What if the splice hits the middle of a multi-byte
encoding?

> No, I know that ;-) But I don't know how you are going to migrate towards
> a more pervasive use of unicode in all the '...' contexts. Whether at
> some point unicode will be built into cpython as the C representation
> of all internal strings

Unicode is not a representation of byte strings, so this cannot
happen.

> or it will use unicode through unicode objects
> and their interfaces, which I imagine would be the way it started.

Yes, all library functions that expect strings should support Unicode
objects. Ideally, all library functions that return strings should
return Unicode objects, but this raises backwards compatibility
issues. For the APIs where this matters much, transition mechanisms
are in progress.

> Memory-limited implementations might want to make different choices IWG,
> so the cleaner the python-unicode relationship the freer those choices
> are likely to be IWT.

I'm not too concerned with memory-limited implementations. It would be
feasible to re-implement the Unicode type to use UTF-8 as its internal
representation, but that would be tedious to do on the C level, and it
would lead to really bad performance, given that slicing and indexing
become inefficient.

> >>     import m1,m2
> >>     print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
> >> 
> >> might have ill-defined meaning
> >
> >That is just one of the problems you run into when associating
>                                                    ^--not ;-)
> >encodings with strings. Fortunately, there is no encoding associated
> >with a byte string.
> So assume ascii, after having stripped away better knowledge?

No, in current Python, there is no doubt about the semantics: We
assume *nothing* about the encoding. Instead, if s1 and s2 are <type
'str'>, we treat them as byte strings. This means that bytes 0..31 and
128..256 are escaped, with special escapes applying to 10, 13, ...,
and bytes 34 and 39.

> It's fine to have a byte type with no encoding associated. But
> unfortunately ISTM str instances seem to be playing a dual role as
> ascii-encoded strings and byte strings. More below.

No. They actually play a dual role as byte strings and somehow-encoded
strings, depending on the application. In many applications, that
encoding is the locale's encoding, but in internet applications, you
often have to handle multiple encodings in a single run of the
program.

> How will the following look when s == '...' becomes effectively s =
> u'...' per above?

I don't know. Because this question is difficult to answer, that
change cannot be made in the near future. It might be reasonable to
have str() return Unicode objects - with another builtin to generate
byte strings.

> BTW, is '...' =(effectively)= u'...' slated for a particular future
> python version?

No. Try running your favourite application with -U, and see what
happens. For Python 2.3, I managed python -U to atleast enter
interactive mode - in 2.2, importing site.py will fail, trying
to put Unicode objects on sys.path.

Regards,
Martin