[Datetime-SIG] Are there any "correct" implementations of tzinfo?

Mon Sep 14 21:19:56 EDT 2015

On Mon, Sep 14, 2015, at 18:09, Tim Peters wrote:
> Sorry, I'm not arguing about this any more.  Pickle doesn't work at
> all at the level of "count of bytes followed by a string". 

The SHORT_BINBYTES opcode consists of the byte b'C', followed by *yes
indeed* "count of bytes followed by a string".

> If you
> want to make a pickle argument that makes sense, I'm afraid you'll
> need to become familiar with how pickle works first.  This is not the
> place for a pickle tutorial.
> 
> Start by learning what a datetime pickle actually is.
> pickletools.dis() will be very helpful.

    0: \x80 PROTO      3
    2: c    GLOBAL     'datetime datetime'
   21: q    BINPUT     0
   23: C    SHORT_BINBYTES b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00'
   35: q    BINPUT     1
   37: \x85 TUPLE1
   38: q    BINPUT     2
   40: R    REDUCE
   41: q    BINPUT     3
   43: .    STOP

The payload is ten bytes, and the byte immediately before it is in fact
0x0a. If I pickle any byte string under 256 bytes long by itself, the
byte immediately before the data is the length. This is how I initially
came to the conclusion that "count of bytes followed by a string" was
valid.

I did, before writing my earlier post, look into the high-level aspects
of how datetime pickle works - it uses __reduce__ to create up to two
arguments, one of which is a 10-byte string, and the other is the
tzinfo. Those arguments are passed into the date constructor and
detected by that constructor - for example, I can call it directly with
datetime(b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00') and get the same result
as unpickling.

At the low level, the part that represents that first argument does
indeed appear to be "count of bytes followed by a string". I can add to
the count, add more bytes, and it will call the constructor with the
longer string. If I use pickletools.dis on my modified value the output
looks the same except for, as expected, the offsets and the value of the
argument to the SHORT_BINBYTES opcode.

So, it appears that, as I was saying, "wasted space" would not have been
an obstacle to having the "payload" accepted by the constructor (and
produced by __reduce__ ultimately _getstate) consist of "a byte string
of >= 10 bytes, the first 10 of which are used and the rest of which are
ignored by python <= 3.5" instead of "a byte string of exactly 10
bytes", since it would have accepted and produced exactly the same
pickle values, but been prepared to accept larger arguments pickled from
future versions.

For completeness: Protocol version 2 and 1 use BINUNICODE on a
latin1-to-utf8 version of the byte string, with a similar "count of
bytes followed by a string" (though the count of bytes is of UTF-8
bytes). Protocol version 0 uses UNICODE, terminated by \n, and a literal
\n is represented by \\u000a. In all cases some extra data around the
value sets it up to call "codecs.encode(..., 'latin1')" upon unpickling.

So have I shown you that I know enough about the pickle format to know
that permitting a longer string (and ignoring the extra bytes) would
have had zero impact on the pickle representation of values that did not
contain a longer string? I'd already figured out half of this before
writing my earlier post; I just assumed *you* knew enough that I
wouldn't have to show my work.

Extra credit:
    0: \x80 PROTO      3
    2: c    GLOBAL     'datetime datetime'
   21: q    BINPUT     0
   23: (    MARK
   24: M        BININT2    2014
   27: K        BININT1    9
   29: K        BININT1    14
   31: K        BININT1    21
   33: K        BININT1    6
   35: K        BININT1    42
   37: t        TUPLE      (MARK at 23)
   38: q    BINPUT     1
   40: R    REDUCE
   41: q    BINPUT     2
   43: .    STOP