[Python-Dev] just say no...

Guido van Rossum guido@CNRI.Reston.VA.US
Mon, 15 Nov 1999 10:50:24 -0500


> On purpose -- according to my thinking. I see "t#" as an interface
> to bf_getcharbuf which I understand as 8-bit character buffer...
> UTF-8 is a multi byte encoding. It still is character data, but
> not necessarily 8 bits in length (up to 24 bits are used).
> 
> Anyway, I'm not really interested in having an argument about
> this. If you say, "t#" fits the purpose, then that's fine with
> me. Still, we should clearly define that "t#" returns
> text data and "s#" binary data. Encoding, bit length, etc. should
> explicitly remain left undefined.

Thanks for not picking an argument.  Multibyte encodings typically
have ASCII as a subset (in such a way that an ASCII string is
represented as itself in bytes).  This is the characteristic that's
needed in my view.

> > > First, we have a general design question here: should old code
> > > become Unicode compatible or not. As I recall the original idea
> > > about Unicode integration was to follow Perl's idea to have
> > > scripts become Unicode aware by simply adding a 'use utf8;'.
> > 
> > I've never heard of this idea before -- or am I taking it too literal?
> > It smells of a mode to me :-)  I'd rather live in a world where
> > Unicode just works as long as you use u'...' literals or whatever
> > convention we decide.
> > 
> > > If this is still the case, then we'll have to come with a
> > > resonable approach for integrating classical string based
> > > APIs with the new type.
> > >
> > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > > the Latin-1 folks) which has some very nice features (see
> > > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > > this encoding seems best fit for the purpose.
> > 
> > Yes, especially if we fix the default encoding as UTF-8.  (I'm
> > expecting feedback from HP on this next week, hopefully when I see the
> > details, it'll be clear that don't need a per-thread default encoding
> > to solve their problems; that's quite a likely outcome.  If not, we
> > have a real-world argument for allowing a variable default encoding,
> > without carnage.)
> 
> Fair enough :-)
>  
> > > However, one should not forget that UTF-8 is in fact a
> > > variable length encoding of Unicode characters, that is up to
> > > 3 bytes form a *single* character. This is obviously not compatible
> > > with definitions that explicitly state data to be using a
> > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > > work like it does in Latin-1 text.
> > 
> > Sure, but where in current Python are there such requirements?
> 
> It was my understanding that "t#" refers to single byte character
> data. That's where the above arguments were aiming at...

t# refers to byte-encoded data.  Multibyte encodings are explicitly
designed to be passed cleanly through processing steps that handle
single-byte character data, as long as they are 8-bit clean and don't
do too much processing.

> > > So if we are to do the integration, we'll have to choose
> > > argument parser markers that allow for multi byte characters.
> > > "t#" does not fall into this category, "s#" certainly does,
> > > "s" is argueable.
> > 
> > I disagree.  I grepped through the source for s# and t#.  Here's a bit
> > of background.  Before t# was introduced, s# was being used for two
> > distinct purposes: (1) to get an 8-bit text string plus its length, in
> > situations where the length was needed; (2) to get binary data (e.g.
> > GIF data read from a file in "rb" mode).  Greg pointed out that if we
> > ever introduced some form of Unicode support, these two had to be
> > disambiguated.  We found that the majority of uses was for (2)!
> > Therefore we decided to change the definition of s# to mean only (2),
> > and introduced t# to mean (1).  Also, we introduced getcharbuffer
> > corresponding to t#, while getreadbuffer was meant for s#.
> 
> I know its too late now, but I can't really follow the arguments
> here: in what ways are (1) and (2) different from the implementations
> point of view ? If "t#" is to return UTF-8 then <length of the 
> buffer> will not equal <text length>, so both parser markers return
> essentially the same information. The only difference would be
> on the semantic side: (1) means: give me text data, while (2) does
> not specify the data type.
> 
> Perhaps I'm missing something...

The idea is that (1)/s# disallows any translation of the data, while
(2)/t# requires translation of the data to an ASCII superset (possibly
multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
contains text and that if the text consists of only ASCII characters
they are represented as themselves.  (1)/s# makes no such assumption.

In terms of implementation, Unicode objects should translate
themselves to the default encoding for t# (if possible), but they
should make the native representation available for s#.

For example, take an encryption engine.  While it is defined in terms
of byte streams, there's no requirement that the bytes represent
characters -- they could be the bytes of a GIF file, an MP3 file, or a
gzipped tar file.  If we pass Unicode to an encryption engine, we want
Unicode to come out at the other end, not UTF-8.  (If we had wanted to
encrypt UTF-8, we should have fed it UTF-8.)

> > Note that the definition of the 's' format was left alone -- as
> > before, it means you need an 8-bit text string not containing null
> > bytes.
> 
> This definition should then be changed to "text string without
> null bytes" dropping the 8-bit reference.

Aha, I think there's a confusion about what "8-bit" means.  For me, a
multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
(As far as I know, C uses char* to represent multibyte characters.)
Maybe we should disambiguate it more explicitly?

> > Our expectation was that a Unicode string passed to an s# situation
> > would give a pointer to the internal format plus a byte count (not a
> > character count!) while t# would get a pointer to some kind of 8-bit
> > translation/encoding plus a byte count, with the explicit requirement
> > that the 8-bit translation would have the same lifetime as the
> > original unicode object.  We decided to leave it up to the next
> > generation (i.e., Marc-Andre :-) to decide what kind of translation to
> > use and what to do when there is no reasonable translation.
> 
> Hmm, I would strongly object to making "s#" return the internal
> format. file.write() would then default to writing UTF-16 data
> instead of UTF-8 data. This could result in strange errors
> due to the UTF-16 format being endian dependent.

But this was the whole design.  file.write() needs to be changed to
use s# when the file is open in binary mode and t# when the file is
open in text mode.

> It would also break the symmetry between file.write(u) and
> unicode(file.read()), since the default encoding is not used as
> internal format for other reasons (see proposal).

If the file is encoded using UTF-16 or UCS-2, you should open it in
binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
app should read the first 2 bytes and check for a BOM and then decide
to choose bewteen 'utf-16-be' and 'utf-16-le'.)

> > Any of the following choices is acceptable (from the point of view of
> > not breaking the intended t# semantics; we can now start deciding
> > which we like best):
> 
> I think we have already agreed on using UTF-8 for the default
> encoding. It has quite a few advantages. See
> 
> 	http://czyborra.com/utf/
> 
> for a good overview of the pros and cons.

Of course.  I was just presenting the list as an argument that if
we changed our mind about the default encoding, t# should follow the
default encoding (and not pick an encoding by other means).

> > - utf-8
> > - latin-1
> > - ascii
> > - shift-jis
> > - lower byte of unicode ordinal
> > - some user- or os-specified multibyte encoding
> > 
> > As far as t# is concerned, for encodings that don't encode all of
> > Unicode, untranslatable characters could be dealt with in any number
> > of ways (raise an exception, ignore, replace with '?', make best
> > effort, etc.).
> 
> The usual Python way would be: raise an exception. This is what
> the proposal defines for Codecs in case an encoding/decoding
> mapping is not possible, BTW. (UTF-8 will always succeed on
> output.)

Did you read Andy Robinson's case study?  He suggested that for
certain encodings there may be other things you can do that are more
user-friendly than raising an exception, depending on the application.
I am proposing to leave this a detail of each specific translation.
There may even be translations that do the same thing except they have
a different behavior for untranslatable cases -- e.g. a strict version
that raises an exception and a non-strict version that replaces bad
characters with '?'.  I think this is one of the powers of having an
extensible set of encodings.

> > Given the current context, it should probably be the same as the
> > default encoding -- i.e., utf-8.  If we end up making the default
> > user-settable, we'll have to decide what to do with untranslatable
> > characters -- but that will probably be decided by the user too (it
> > would be a property of a specific translation specification).
> > 
> > In any case, I feel that t# could receive a multi-byte encoding,
> > s# should receive raw binary data, and they should correspond to
> > getcharbuffer and getreadbuffer, respectively.
> 
> Why would you want to have "s#" return the raw binary data for
> Unicode objects ? 

Because file.write() for a binary file, and other similar things
(e.g. the encryption engine example I mentioned above) must have
*some* way to get at the raw bits.

> Note that it is not mentioned anywhere that
> "s#" and "t#" do have to necessarily return different things
> (binary being a superset of text). I'd opt for "s#" and "t#" both
> returning UTF-8 data. This can be implemented by delegating the
> buffer slots to the <defencstr> object (see below).

This would defeat the whole purpose of introducing t#.  We might as
well drop t# then altogether if we adopt this.

> > > Now Greg would chime in with the buffer interface and
> > > argue that it should make the underlying internal
> > > format accessible. This is a bad idea, IMHO, since you
> > > shouldn't really have to know what the internal data format
> > > is.
> > 
> > This is for C code.  Quite likely it *does* know what the internal
> > data format is!
> 
> C code can use the PyUnicode_* APIs to access the data. I
> don't think that argument parsing is powerful enough to
> provide the C code with enough information about the data
> contents, e.g. it can only state the encoding length, not the
> string length.

Typically, all the C code does is pass multibyte encoded strings on to
other library routines that know what to do to them, or simply give
them back unchanged at a later time.  It is essential to know the
number of bytes, for memory allocation purposes.  The number of
characters is totally immaterial (and multibyte-handling code knows
how to calculate the number of characters anyway).

> > > Defining "s#" to return UTF-8 data does not only
> > > make "s" and "s#" return the same data format (which should
> > > always be the case, IMO),
> > 
> > That was before t# was introduced.  No more, alas.  If you replace s#
> > with t#, I agree with you completely.
> 
> Done :-)
>  
> > > but also hides the internal
> > > format from the user and gives him a reliable cross-platform
> > > data representation of Unicode data (note that UTF-8 doesn't
> > > have the byte order problems of UTF-16).
> > >
> > > If you are still with, let's look at what "s" and "s#"
> > 
> > (and t#, which is more relevant here)
> > 
> > > do: they return pointers into data areas which have to
> > > be kept alive until the corresponding object dies.
> > >
> > > The only way to support this feature is by allocating
> > > a buffer for just this purpose (on the fly and only if
> > > needed to prevent excessive memory load). The other
> > > options of adding new magic parser markers or switching
> > > to more generic one all have one downside: you need to
> > > change existing code which is in conflict with the idea
> > > we started out with.
> > 
> > Agreed.  I think this was our thinking when Greg & I introduced t#.
> > My own preference would be to allocate a whole string object, not
> > just a buffer; this could then also be used for the .encode() method
> > using the default encoding.
> 
> Good point. I'll change <defencbuf> to <defencstr>, a Python
> string object created on request.
>  
> > > So, again, the question is: do we want this magical
> > > intergration or not ? Note that this is a design question,
> > > not one of memory consumption...
> > 
> > Yes, I want it.
> > 
> > Note that this doesn't guarantee that all old extensions will work
> > flawlessly when passed Unicode objects; but I think that it covers
> > most cases where you could have a reasonable expectation that it
> > works.
> > 
> > (Hm, unfortunately many reasonable expectations seem to involve
> > the current user's preferred encoding. :-( )
> 
> -- 
> Marc-Andre Lemburg

--Guido van Rossum (home page: http://www.python.org/~guido/)