[Python-Dev] just say no...

M.-A. Lemburg mal@lemburg.com
Sun, 14 Nov 1999 23:11:54 +0100


Guido van Rossum wrote:
> 
> I think I have a reasonable grasp of the issues here, even though I
> still haven't read about 100 msgs in this thread.  Note that t# and
> the charbuffer addition to the buffer API were added by Greg Stein
> with my support; I'll attempt to reconstruct our thinking at the
> time...
>
> [MAL]
> > Let me summarize a bit on the general ideas behind "s", "s#"
> > and the extra buffer:
> 
> I think you left out t#.

On purpose -- according to my thinking. I see "t#" as an interface
to bf_getcharbuf which I understand as 8-bit character buffer...
UTF-8 is a multi byte encoding. It still is character data, but
not necessarily 8 bits in length (up to 24 bits are used).

Anyway, I'm not really interested in having an argument about
this. If you say, "t#" fits the purpose, then that's fine with
me. Still, we should clearly define that "t#" returns
text data and "s#" binary data. Encoding, bit length, etc. should
explicitly remain left undefined.

> > First, we have a general design question here: should old code
> > become Unicode compatible or not. As I recall the original idea
> > about Unicode integration was to follow Perl's idea to have
> > scripts become Unicode aware by simply adding a 'use utf8;'.
> 
> I've never heard of this idea before -- or am I taking it too literal?
> It smells of a mode to me :-)  I'd rather live in a world where
> Unicode just works as long as you use u'...' literals or whatever
> convention we decide.
> 
> > If this is still the case, then we'll have to come with a
> > resonable approach for integrating classical string based
> > APIs with the new type.
> >
> > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > the Latin-1 folks) which has some very nice features (see
> > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > this encoding seems best fit for the purpose.
> 
> Yes, especially if we fix the default encoding as UTF-8.  (I'm
> expecting feedback from HP on this next week, hopefully when I see the
> details, it'll be clear that don't need a per-thread default encoding
> to solve their problems; that's quite a likely outcome.  If not, we
> have a real-world argument for allowing a variable default encoding,
> without carnage.)

Fair enough :-)
 
> > However, one should not forget that UTF-8 is in fact a
> > variable length encoding of Unicode characters, that is up to
> > 3 bytes form a *single* character. This is obviously not compatible
> > with definitions that explicitly state data to be using a
> > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > work like it does in Latin-1 text.
> 
> Sure, but where in current Python are there such requirements?

It was my understanding that "t#" refers to single byte character
data. That's where the above arguments were aiming at...
 
> > So if we are to do the integration, we'll have to choose
> > argument parser markers that allow for multi byte characters.
> > "t#" does not fall into this category, "s#" certainly does,
> > "s" is argueable.
> 
> I disagree.  I grepped through the source for s# and t#.  Here's a bit
> of background.  Before t# was introduced, s# was being used for two
> distinct purposes: (1) to get an 8-bit text string plus its length, in
> situations where the length was needed; (2) to get binary data (e.g.
> GIF data read from a file in "rb" mode).  Greg pointed out that if we
> ever introduced some form of Unicode support, these two had to be
> disambiguated.  We found that the majority of uses was for (2)!
> Therefore we decided to change the definition of s# to mean only (2),
> and introduced t# to mean (1).  Also, we introduced getcharbuffer
> corresponding to t#, while getreadbuffer was meant for s#.

I know its too late now, but I can't really follow the arguments
here: in what ways are (1) and (2) different from the implementations
point of view ? If "t#" is to return UTF-8 then <length of the 
buffer> will not equal <text length>, so both parser markers return
essentially the same information. The only difference would be
on the semantic side: (1) means: give me text data, while (2) does
not specify the data type.

Perhaps I'm missing something...
 
> Note that the definition of the 's' format was left alone -- as
> before, it means you need an 8-bit text string not containing null
> bytes.

This definition should then be changed to "text string without
null bytes" dropping the 8-bit reference.
 
> Our expectation was that a Unicode string passed to an s# situation
> would give a pointer to the internal format plus a byte count (not a
> character count!) while t# would get a pointer to some kind of 8-bit
> translation/encoding plus a byte count, with the explicit requirement
> that the 8-bit translation would have the same lifetime as the
> original unicode object.  We decided to leave it up to the next
> generation (i.e., Marc-Andre :-) to decide what kind of translation to
> use and what to do when there is no reasonable translation.

Hmm, I would strongly object to making "s#" return the internal
format. file.write() would then default to writing UTF-16 data
instead of UTF-8 data. This could result in strange errors
due to the UTF-16 format being endian dependent.

It would also break the symmetry between file.write(u) and
unicode(file.read()), since the default encoding is not used as
internal format for other reasons (see proposal).

> Any of the following choices is acceptable (from the point of view of
> not breaking the intended t# semantics; we can now start deciding
> which we like best):

I think we have already agreed on using UTF-8 for the default
encoding. It has quite a few advantages. See

	http://czyborra.com/utf/

for a good overview of the pros and cons.

> - utf-8
> - latin-1
> - ascii
> - shift-jis
> - lower byte of unicode ordinal
> - some user- or os-specified multibyte encoding
> 
> As far as t# is concerned, for encodings that don't encode all of
> Unicode, untranslatable characters could be dealt with in any number
> of ways (raise an exception, ignore, replace with '?', make best
> effort, etc.).

The usual Python way would be: raise an exception. This is what
the proposal defines for Codecs in case an encoding/decoding
mapping is not possible, BTW. (UTF-8 will always succeed on
output.)
 
> Given the current context, it should probably be the same as the
> default encoding -- i.e., utf-8.  If we end up making the default
> user-settable, we'll have to decide what to do with untranslatable
> characters -- but that will probably be decided by the user too (it
> would be a property of a specific translation specification).
> 
> In any case, I feel that t# could receive a multi-byte encoding,
> s# should receive raw binary data, and they should correspond to
> getcharbuffer and getreadbuffer, respectively.

Why would you want to have "s#" return the raw binary data for
Unicode objects ? 

Note that it is not mentioned anywhere that
"s#" and "t#" do have to necessarily return different things
(binary being a superset of text). I'd opt for "s#" and "t#" both
returning UTF-8 data. This can be implemented by delegating the
buffer slots to the <defencstr> object (see below).

> > Now Greg would chime in with the buffer interface and
> > argue that it should make the underlying internal
> > format accessible. This is a bad idea, IMHO, since you
> > shouldn't really have to know what the internal data format
> > is.
> 
> This is for C code.  Quite likely it *does* know what the internal
> data format is!

C code can use the PyUnicode_* APIs to access the data. I
don't think that argument parsing is powerful enough to
provide the C code with enough information about the data
contents, e.g. it can only state the encoding length, not the
string length.
 
> > Defining "s#" to return UTF-8 data does not only
> > make "s" and "s#" return the same data format (which should
> > always be the case, IMO),
> 
> That was before t# was introduced.  No more, alas.  If you replace s#
> with t#, I agree with you completely.

Done :-)
 
> > but also hides the internal
> > format from the user and gives him a reliable cross-platform
> > data representation of Unicode data (note that UTF-8 doesn't
> > have the byte order problems of UTF-16).
> >
> > If you are still with, let's look at what "s" and "s#"
> 
> (and t#, which is more relevant here)
> 
> > do: they return pointers into data areas which have to
> > be kept alive until the corresponding object dies.
> >
> > The only way to support this feature is by allocating
> > a buffer for just this purpose (on the fly and only if
> > needed to prevent excessive memory load). The other
> > options of adding new magic parser markers or switching
> > to more generic one all have one downside: you need to
> > change existing code which is in conflict with the idea
> > we started out with.
> 
> Agreed.  I think this was our thinking when Greg & I introduced t#.
> My own preference would be to allocate a whole string object, not
> just a buffer; this could then also be used for the .encode() method
> using the default encoding.

Good point. I'll change <defencbuf> to <defencstr>, a Python
string object created on request.
 
> > So, again, the question is: do we want this magical
> > intergration or not ? Note that this is a design question,
> > not one of memory consumption...
> 
> Yes, I want it.
> 
> Note that this doesn't guarantee that all old extensions will work
> flawlessly when passed Unicode objects; but I think that it covers
> most cases where you could have a reasonable expectation that it
> works.
> 
> (Hm, unfortunately many reasonable expectations seem to involve
> the current user's preferred encoding. :-( )

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    47 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/