[Python-Dev] buffer design (was: marshal (was:Buffer interface in abstract.c?))

Sat, 14 Aug 1999 04:34:17 -0700

M.-A. Lemburg wrote:
>...
> I meant PyUnicode_* style APIs for dealing with all the aspects
> of Unicode objects -- much like the PyString_* APIs available.

Sure, these could be added as necessary. For raw access to the bytes, I
would refer people to the abstract buffer functions, tho.

> > Your abstract.c functions make it quite simple.
> 
> BTW, do we need an extra set of those with buffer index or not ?
> Those would really be one-liners for the sake of hiding the
> type slots from applications.

It sounds like NumPy and PIL would need it, which makes the landscape
quite a bit different from the last time we discussed this (when we
didn't imagine anybody needing those).

>...
> > > Since fp.write() uses "s#" this would use the getreadbuffer
> > > slot in 1.5.2... I think what it *should* do is use the
> > > getcharbuffer slot instead (see my other post), since dumping
> > > the raw unicode data would loose too much information. Again,
> >
> > I very much disagree. To me, fp.write() is not about writing characters
> > to a stream. I think it makes much more sense as "writing bytes to a
> > stream" and the buffer interface fits that perfectly.
> 
> This is perfectly ok, but shouldn't the behaviour of fp.write()
> mimic that of previous Python versions ? How does JPython
> write the data ?

fp.write() had no semantics for writing Unicode objects since they
didn't exist. Therefore, we are not breaking or changing any behavior.

> Inlined different subject:
> I think the internal semantics of "s#" using the getreadbuffer slot
> and "t#" the getcharbuffer slot should be switched; see my other post.

1) Too late
2) The use of "t#" ("text") for the getcharbuffer slot was decided by
the Benevolent Dictator.
3) see (2)

> In previous Python versions "s#" had the semantics of string data
> with possibly embedded NULL bytes. Now it suddenly has the meaning
> of binary data and you can't simply change extensions to use the
> new "t#" because people are still using them with older Python
> versions.

Guido and I had a pretty long discussion on what the best approach here
was. I think we even pulled in Tim as a final arbiter, as I recall.

I believe "s#" remained getreadbuffer simply because it *also* meant
"give me the bytes of that object". If it changed to getcharbuffer, then
you could see exceptions in code that didn't raise exceptions
beforehand.

(more below)

> > There is no loss of data. You could argue that the byte order is lost,
> > but I think that is incorrect. The application defines the semantics:
> > the file might be defined as using host-order, or the application may be
> > writing a BOM at the head of the file.
> 
> The problem here is that many application were not written
> to handle these kind of objects. Previously they could only
> handle strings, now they can suddenly handle any object
> having the buffer interface and then fail when the data
> gets read back in.

An application is a complete unit. How are you suddenly going to
manifest Unicode objects within that application? The only way is if the
developer goes in and changes things; let them deal with the issues and
fallout of their change. The other is external changes such as an
upgrade to the interpreter or a module. Again, (IMO) if you're
perturbing a system, then you are responsible for also correcting any
problems you introduce.

In any case, Guido's position was that things can easily switch over to
the "t#" interface to prevent the class of error where you pass a
Unicode string to a function that expects a standard string.

> > > such things should be handled by extra methods, e.g. fp.rawwrite().
> >
> > I believe this would be a needless complication of the interface.
> 
> It would clarify things and make the interface 100% backward
> compatible again.

No. "s#" used to pull bytes from any buffer-capable object. Your
suggestion for "s#" to use the getcharbuffer could introduce exceptions
into currently-working code.

(this was probably Guido's prime motivation for the currently meaning of
"t#"... I can dig up the mail thread if people need an authoritative
commentary on the decision that was made)

> > > Hmm, I guess the philosophy behind the interface is not
> > > really clear.
> >
> > I didn't design or implement it initially, but (as you may have guessed)
> > I am a proponent of its existence.
> >
> > > Binary data is fetched via getreadbuffer and then
> > > interpreted as character data... I always thought that the
> > > getcharbuffer should be used for such an interpretation.
> >
> > The former is bad behavior. That is why getcharbuffer was added (by me,
> > for 1.5.2). It was a preventative measure for the introduction of
> > Unicode strings. Using getreadbuffer for characters would break badly
> > given a Unicode string. Therefore, "clients" that want (8-bit)
> > characters from an object supporting the buffer interface should use
> > getcharbuffer. The Unicode object doesn't implement it, implying that it
> > cannot provide 8-bit characters. You can get the raw bytes thru
> > getreadbuffer.
> 
> I agree 100%, but did you add the "t#" instead of having
> "s#" use the getcharbuffer interface ?

Yes. For reasons detailed above.

> E.g. my mxTextTools
> package uses "s#" on many APIs. Now someone could stick
> in a Unicode object and get pretty strange results without
> any notice about mxTextTools and Unicode being incompatible.

They could also stick in an array of integers. That supports the buffer
interface, meaning the "s#" in your code would extract the bytes from
it. In other words, people can already stick bogus stuff into your code.

This seems to be a moot argument.

> You could argue that I change to "t#", but that doesn't
> work since many people out there still use Python versions
> <1.5.2 and those didn't have "t#", so mxTextTools would then
> fail completely for them.

If support for the older versions is needed, then use an #ifdef to set
up the appropriate macro in some header. Use that throughout your code.

In any case: yes -- I would argue that you should absolutely be using
"t#".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/