[Python-Dev] marshal (was:Buffer interface in abstract.c? )

Greg Stein gstein@lyra.org
Sat, 14 Aug 1999 01:53:04 -0700


M.-A. Lemburg wrote:
> 
> Greg Stein wrote:
> >
> > On Tue, 10 Aug 1999, Fredrik Lundh wrote:
> > > maybe the unicode class shouldn't implement the
> > > buffer interface at all?  sure looks like the best way
> >
> > It is needed for fp.write(unicodeobj) ...
> >
> > It is also very handy for C functions to deal with Unicode strings.
> 
> Wouldn't a special C API be (even) more convenient ?

Why? Accessing the Unicode values as a series of bytes matches exactly
to the semantics of the buffer interface. Why throw in Yet Another
Function?

Your abstract.c functions make it quite simple.

> > > to avoid trivial mistakes (the current behaviour of
> > > fp.write(unicodeobj) is even more serious than the
> > > marshal glitch...)
> >
> > What's wrong with fp.write(unicodeobj)? It should write the unicode value
> > to the file. Are you suggesting that it will need to be done differently?
> > Icky.
> 
> Would this also write some kind of Unicode encoding header ?
> [Sorry, this is my Unicode ignorance shining through... I only
>  remember lots of talk about these things on the string-sig.]

Absolutely not. Placing the Byte Order Mark (BOM) into an output stream
is an application-level task. It should never by done by any subsystem.

There are no other "encoding headers" that would go into the output
stream. The output would simply be UTF-16 (2-byte values in host byte
order).

> Since fp.write() uses "s#" this would use the getreadbuffer
> slot in 1.5.2... I think what it *should* do is use the
> getcharbuffer slot instead (see my other post), since dumping
> the raw unicode data would loose too much information. Again,

I very much disagree. To me, fp.write() is not about writing characters
to a stream. I think it makes much more sense as "writing bytes to a
stream" and the buffer interface fits that perfectly.

There is no loss of data. You could argue that the byte order is lost,
but I think that is incorrect. The application defines the semantics:
the file might be defined as using host-order, or the application may be
writing a BOM at the head of the file.

> such things should be handled by extra methods, e.g. fp.rawwrite().

I believe this would be a needless complication of the interface.

> Hmm, I guess the philosophy behind the interface is not
> really clear.

I didn't design or implement it initially, but (as you may have guessed)
I am a proponent of its existence.

> Binary data is fetched via getreadbuffer and then
> interpreted as character data... I always thought that the
> getcharbuffer should be used for such an interpretation.

The former is bad behavior. That is why getcharbuffer was added (by me,
for 1.5.2). It was a preventative measure for the introduction of
Unicode strings. Using getreadbuffer for characters would break badly
given a Unicode string. Therefore, "clients" that want (8-bit)
characters from an object supporting the buffer interface should use
getcharbuffer. The Unicode object doesn't implement it, implying that it
cannot provide 8-bit characters. You can get the raw bytes thru
getreadbuffer.

> Or maybe, we should dump the getcharbufer slot again and
> use the getreadbuffer information just as we would a
> void* pointer in C: with no explicit or implicit type information.

Nope. That path is frought with failure :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/