[Python-Dev] Dropping bytes "support" in json

Paul Moore p.f.moore at gmail.com
Fri Apr 10 13:53:47 CEST 2009


2009/4/10 Nick Coghlan <ncoghlan at gmail.com>:
> glyph at divmod.com wrote:
>> On 03:21 am, ncoghlan at gmail.com wrote:
>>> Given that json is a wire protocol, that sounds like the right approach
>>> for json as well. Once bytes-everywhere works, then a text API can be
>>> built on top of it, but it is difficult to build a bytes API on top of a
>>> text one.
>>
>> I wish I could agree, but JSON isn't really a wire protocol.  According
>> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the
>> serialization of structured data".  There are some notes about encoding,
>> but it is very clearly described in terms of unicode code points.
>
> Ah, my apologies - if the RFC defines things such that the native format
> is Unicode, then yes, the appropriate Python 3.x data type for the base
> implementation would indeed be strings.

Indeed, the RFC seems to clearly imply that loads should take a
Unicode string, dumps should produce one, and load/dump should work in
terms of text files (not byte files).

On the other hand, further down in the document:

"""
3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.
"""

This is at best confused (in my utterly non-expert opinion :-)) as
Unicode isn't an encoding...

I would guess that what the RFC is trying to say is that JSON is text
(Unicode) and where a byte stream purporting to be JSON is encountered
without a defined encoding, this is how to guess one.

That implies that loads can/should also allow bytes as input, applying
the given algorithm to guess an encoding. And similarly load
can/should accept a byte stream, on the same basis. (There's no need
to allow the possibility of accepting bytes plus an encoding - in that
case the user should decode the bytes before passing Unicode to the
JSON module).

An alternative might be for the JSON module to register a special
encoding ('JSON-guess'?) which captures the rules here. Then there's
no need for special bytes parameter handling.

Of course, this is all from a native English speaker, who therefore
has no idea of the real life issues involved in Unicode :-)

Paul.


More information about the Python-Dev mailing list