[Python-Dev] Dropping bytes "support" in json

"Martin v. Löwis" martin at v.loewis.de
Fri Apr 10 00:07:23 CEST 2009


> As far as Python 3 goes, I honestly have not yet familiarized myself
> with the changes to the IO infrastructure and what the new idioms are.
> At this time, I can't make any educated decisions with regard to how
> it should be done because I don't know exactly how bytes are supposed
> to work and what the common idioms are for other libraries in the
> stdlib that do similar things.

It's really very similar to 2.x: the "bytes" type is to used in all
interfaces that operate on byte sequences that may or may not represent
characters; in particular, for interface where the operating system
deliberately uses bytes - ie. low-level file IO and socket IO; also
for cases where the encoding is embedded in the stream that still
needs to be processed (e.g. XML parsing).

(Unicode) strings should be used where the data is truly text by
nature, i.e. where no encoding information is necessary to find out
what characters are intended. It's used on interfaces where the
encoding is known (e.g. text IO, where the encoding is specified
on opening, XML parser results, with the declared encoding, and
GUI libraries, which naturally expect text).

> Until I figure that out, someone else
> is better off making decisions about the Python 3 version.

Some of us can certainly explain to you how this is supposed to
work. However, we need you to check any assumption against the
known use cases - would the users of the module be happy if it
worked one way or the other?

> My guess is
> that it should work the same way as it does in Python 2.x: take bytes
> or unicode input in loads (which means encoding is still relevant). I
> also think the output of dumps should also be bytes, since it is a
> serialization, but I am not sure how other libraries do this in Python
> 3 because one could argue that it is also text.

This, indeed, had been an endless debate, and, in the end, the decision
was somewhat arbitrary. Here are some examples:

- base64.encodestring expects bytes (naturally, since it is supposed to
  encode arbitrary binary data), and produces bytes (debatably)
- binascii.b2a_hex likewise (expect and produce bytes)
- pickle.dumps produces bytes (uniformly, both for binary and text
  pickles)
- marshal.dumps likewise
- email.message.Message().as_string produces a (unicode) string
  (see Barry's recent thread on whether that's a good thing; the
  email package hasn't been fully ported to 3k, either)
- the XML libraries (continue to) parse bytes, and produce
  Unicode strings
- for the IO libraries, see above

> If other libraries
> that do text/text encodings (e.g. binascii, mimelib, ...) use str for
> input and output

See above - most of them don't; mimetools is no longer (replaced by
email package)

> instead of bytes then maybe Antoine's changes are the
> right solution and I just don't know better because I'm not up to
> speed with how people write Python 3 code.

There isn't too much fresh end-user code out there, so we can't really
tell, either. As for standard library users - users will do whatever
the library forces them to do.

This is why I'm so concerned about this issue: we should get it right,
or not done at all. I still think you would be the best person to
determine what is right.

> I'll do my best to find some time to look into Python 3 more closely
> soon, but thus far I have not been very motivated to do so because
> Python 3 isn't useful for us at work and twiddling syntax isn't a very
> interesting problem for me to solve.

And I didn't expect you to - it seems people are quite willing to do
the actual work, as long as there is some guidance.

Regards,
Martin


More information about the Python-Dev mailing list