[Python-Dev] just say no...

Greg Stein gstein@lyra.org
Tue, 16 Nov 1999 04:04:47 -0800 (PST)


On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
>...
> > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > designed to be passed cleanly through processing steps that handle
> > single-byte character data, as long as they are 8-bit clean and don't
> > do too much processing.
> 
> Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> "8-bit clean" as you obviously did.

Hrm. That might be dangerous. Many of the functions that use "t#" assume
that each character is 8-bits long. i.e. the returned length == the number
of characters.

I'm not sure what the implications would be if you interpret the semantics
of "t#" as multi-byte characters.

>...
> > For example, take an encryption engine.  While it is defined in terms
> > of byte streams, there's no requirement that the bytes represent
> > characters -- they could be the bytes of a GIF file, an MP3 file, or a
> > gzipped tar file.  If we pass Unicode to an encryption engine, we want
> > Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> > encrypt UTF-8, we should have fed it UTF-8.)

Heck. I just want to quickly throw the data onto my disk. I'll write a
BOM, following by the raw data. Done. It's even portable.

>...
> > Aha, I think there's a confusion about what "8-bit" means.  For me, a
> > multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?

Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t"
format).

> > (As far as I know, C uses char* to represent multibyte characters.)
> > Maybe we should disambiguate it more explicitly?

We can disambiguate with a new format character, or we can clarify the
semantics of "t" to mean single- *or* multi- byte characters. Again, I
think there may be trouble if the semantics of "t" are defined to allow
multibyte characters.

> There should be some definition for the two markers and the
> ideas behind them in the API guide, I guess.

Certainly.

[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

> > > Hmm, I would strongly object to making "s#" return the internal
> > > format. file.write() would then default to writing UTF-16 data
> > > instead of UTF-8 data. This could result in strange errors
> > > due to the UTF-16 format being endian dependent.
> > 
> > But this was the whole design.  file.write() needs to be changed to
> > use s# when the file is open in binary mode and t# when the file is
> > open in text mode.

Interesting idea, but that presumes that "t" will be defined for the
Unicode
object (i.e. it implements the getcharbuffer type slot). Because of the
multi-byte problem, I don't think it will.
[ not to mention, that I don't think the Unicode object should implicitly
  do a UTF-8 conversion and hold a ref to the resulting string ]

>...
> I still don't feel very comfortable about the fact that all
> existing APIs using "s#" will suddenly receive UTF-16 data if
> being passed Unicode objects: this probably won't get us the
> "magical" Unicode integration we invision, since "t#" usage is not
> very wide spread and character handling code will probably not
> work well with UTF-16 encoded strings.

I'm not sure that we should definitely go for "magical." Perl has magic in
it, and that is one of its worst faults. Go for clean and predictable, and
leave as much logic to the Python level as possible. The interpreter
should provide a minimum of functionality, rather than second-guessing and
trying to be neat and sneaky with its operation.

>...
> > Because file.write() for a binary file, and other similar things
> > (e.g. the encryption engine example I mentioned above) must have
> > *some* way to get at the raw bits.
> 
> What for ?

How about: "because I'm the application developer, and I say that I want
the raw bytes in the file."

> Any lossless encoding should do the trick... UTF-8
> is just as good as UTF-16 for binary files; plus it's more compact
> for ASCII data. I don't really see a need to get explicitly
> at the internal data representation because both encodings are
> in fact "internal" w/r to Unicode objects.
> 
> The only argument I can come up with is that using UTF-16 for
> binary files could (possibly) eliminate the UTF-8 conversion step
> which is otherwise always needed.

The argument that I come up with is "don't tell me how to design my
storage format, and don't make Python force me into one."

If I want to write Unicode text to a file, the most natural thing to do
is:

open('file', 'w').write(u)

If you do a conversion on me, then I'm not writing Unicode. I've got to go
and do some nasty conversion which just monkeys up my program.

If I have a Unicode object, but I *want* to write UTF-8 to the file, then
the cleanest thing is:

open('file', 'w').write(encode(u, 'utf-8'))

This is clear that I've got a Unicode object input, but I'm writing UTF-8.

I have a second argument, too: See my first argument. :-)

Really... this is kind of what Fredrik was trying to say: don't get in the
way of the application programmer. Give them tools, but avoid policy and
gimmicks and other "magic".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/