[Python-Dev] just say no...

Guido van Rossum guido@CNRI.Reston.VA.US
Sat, 13 Nov 1999 08:06:26 -0500


I think I have a reasonable grasp of the issues here, even though I
still haven't read about 100 msgs in this thread.  Note that t# and
the charbuffer addition to the buffer API were added by Greg Stein
with my support; I'll attempt to reconstruct our thinking at the
time...

[MAL]
> Let me summarize a bit on the general ideas behind "s", "s#"
> and the extra buffer:

I think you left out t#.

> First, we have a general design question here: should old code
> become Unicode compatible or not. As I recall the original idea
> about Unicode integration was to follow Perl's idea to have
> scripts become Unicode aware by simply adding a 'use utf8;'.

I've never heard of this idea before -- or am I taking it too literal?
It smells of a mode to me :-)  I'd rather live in a world where
Unicode just works as long as you use u'...' literals or whatever
convention we decide.

> If this is still the case, then we'll have to come with a
> resonable approach for integrating classical string based
> APIs with the new type.
> 
> Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> the Latin-1 folks) which has some very nice features (see
> http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> this encoding seems best fit for the purpose.

Yes, especially if we fix the default encoding as UTF-8.  (I'm
expecting feedback from HP on this next week, hopefully when I see the
details, it'll be clear that don't need a per-thread default encoding
to solve their problems; that's quite a likely outcome.  If not, we
have a real-world argument for allowing a variable default encoding,
without carnage.)

> However, one should not forget that UTF-8 is in fact a
> variable length encoding of Unicode characters, that is up to
> 3 bytes form a *single* character. This is obviously not compatible
> with definitions that explicitly state data to be using a
> 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> work like it does in Latin-1 text.

Sure, but where in current Python are there such requirements?

> So if we are to do the integration, we'll have to choose
> argument parser markers that allow for multi byte characters.
> "t#" does not fall into this category, "s#" certainly does,
> "s" is argueable.

I disagree.  I grepped through the source for s# and t#.  Here's a bit
of background.  Before t# was introduced, s# was being used for two
distinct purposes: (1) to get an 8-bit text string plus its length, in
situations where the length was needed; (2) to get binary data (e.g.
GIF data read from a file in "rb" mode).  Greg pointed out that if we
ever introduced some form of Unicode support, these two had to be
disambiguated.  We found that the majority of uses was for (2)!
Therefore we decided to change the definition of s# to mean only (2),
and introduced t# to mean (1).  Also, we introduced getcharbuffer
corresponding to t#, while getreadbuffer was meant for s#.

Note that the definition of the 's' format was left alone -- as
before, it means you need an 8-bit text string not containing null
bytes.

Our expectation was that a Unicode string passed to an s# situation
would give a pointer to the internal format plus a byte count (not a
character count!) while t# would get a pointer to some kind of 8-bit
translation/encoding plus a byte count, with the explicit requirement
that the 8-bit translation would have the same lifetime as the
original unicode object.  We decided to leave it up to the next
generation (i.e., Marc-Andre :-) to decide what kind of translation to
use and what to do when there is no reasonable translation.

Any of the following choices is acceptable (from the point of view of
not breaking the intended t# semantics; we can now start deciding
which we like best):

- utf-8
- latin-1
- ascii
- shift-jis
- lower byte of unicode ordinal
- some user- or os-specified multibyte encoding

As far as t# is concerned, for encodings that don't encode all of
Unicode, untranslatable characters could be dealt with in any number
of ways (raise an exception, ignore, replace with '?', make best
effort, etc.).

Given the current context, it should probably be the same as the
default encoding -- i.e., utf-8.  If we end up making the default
user-settable, we'll have to decide what to do with untranslatable
characters -- but that will probably be decided by the user too (it
would be a property of a specific translation specification).

In any case, I feel that t# could receive a multi-byte encoding, 
s# should receive raw binary data, and they should correspond to
getcharbuffer and getreadbuffer, respectively.

(Aside: the symmetry between 's' and 's#' is now lost; 's' matches
't#', there's no match for 's#'.)

> Also note that we have to watch out for embedded NULL bytes.
> UTF-16 has NULL bytes for every character from the Latin-1
> domain. If "s" were to give back a pointer to the internal
> buffer which is encoded in UTF-16, you would loose data.
> UTF-8 doesn't have this problem, since only NULL bytes
> map to (single) NULL bytes.

This is a red herring given my explanation above.

> Now Greg would chime in with the buffer interface and
> argue that it should make the underlying internal
> format accessible. This is a bad idea, IMHO, since you
> shouldn't really have to know what the internal data format
> is.

This is for C code.  Quite likely it *does* know what the internal
data format is!

> Defining "s#" to return UTF-8 data does not only
> make "s" and "s#" return the same data format (which should
> always be the case, IMO),

That was before t# was introduced.  No more, alas.  If you replace s#
with t#, I agree with you completely.

> but also hides the internal
> format from the user and gives him a reliable cross-platform
> data representation of Unicode data (note that UTF-8 doesn't
> have the byte order problems of UTF-16).
> 
> If you are still with, let's look at what "s" and "s#"

(and t#, which is more relevant here)

> do: they return pointers into data areas which have to
> be kept alive until the corresponding object dies.
> 
> The only way to support this feature is by allocating
> a buffer for just this purpose (on the fly and only if
> needed to prevent excessive memory load). The other
> options of adding new magic parser markers or switching
> to more generic one all have one downside: you need to
> change existing code which is in conflict with the idea
> we started out with.

Agreed.  I think this was our thinking when Greg & I introduced t#.
My own preference would be to allocate a whole string object, not
just a buffer; this could then also be used for the .encode() method
using the default encoding.

> So, again, the question is: do we want this magical
> intergration or not ? Note that this is a design question,
> not one of memory consumption...

Yes, I want it.

Note that this doesn't guarantee that all old extensions will work
flawlessly when passed Unicode objects; but I think that it covers
most cases where you could have a reasonable expectation that it
works.

(Hm, unfortunately many reasonable expectations seem to involve
the current user's preferred encoding. :-( )

> --
> 
> Ok, the above covered Unicode -> String conversion. Mark
> mentioned that he wanted the other way around to also
> work in the same fashion, ie. automatic String -> Unicode
> conversion. 
> 
> This could also be done in the same way by
> interpreting the string as UTF-8 encoded Unicode... but we
> have the same problem: where to put the data without
> generating new intermediate objects. Since only newly
> written code will use this feature there is a way to do
> this though:
> 
> PyArg_ParseTuple(args,"s#",&utf8,&len);

No!  That is supposed to give the native representation of the string
object.

I agree that Mark's problem requires a solution too, but it doesn't
have to use existing formatting characters, since there's no backwards
compatibility issue.

> If your C API understands UTF-8 there's nothing more to do,
> if not, take Greg's option 3 approach:
> 
> PyArg_ParseTuple(args,"O",&obj);
> unicode = PyUnicode_FromObject(obj);
> ...
> Py_DECREF(unicode);
> 
> Here PyUnicode_FromObject() will return a new
> reference if obj is an Unicode object or create a new
> Unicode object by interpreting str(obj) as UTF-8 encoded string.

This might work.

--Guido van Rossum (home page: http://www.python.org/~guido/)