[Tutor] Pickles and Shelves Concept

Martin A. Brown martin at linux-ip.net
Sat Jun 8 01:11:55 EDT 2019


Hello,

>I am not getting the concept of pickle and shelves in python, I 
>mean what's the use of both the concepts, when to use them in code 
>instead of using file read and write operations.
>
>Could anyone please explain me the concepts.

I see you already have two answers (from David Rock and Alan 
Gauld).  
I will add a slightly different answer and try also to explain some 
of the history (at a very high level).

* Programs need to take input from "somewhere".
* Programs need data structures in memory on which to operate.

There are many different ways to store data "somewhere" and in order 
to create the data structures in memory (on which your program will 
operate), you need to have code that knows how to read that data 
from "somewhere".

So, there's data that has been written to disk.  I often call this 
the serialized form form of the data [0].  There different usages 
for serialization, but ones I'll talk about are the serialized 
formats that we typically write to disk to store data for a program 
to read.  Here are a few such serialized formats:

 * JSON or XML (homage: SGML)
 * pickles and shelves
 * GNU dbm, ndbm, cdb, rocksdb and probably 1 million others
 * custom binary formats
 * plain text files (be careful with your encoding...)

Digression:  You might ask... what about SQL?  Technically, the 
serialization is something that the SQL database software takes care 
of and your application doesn't.  So no need to know about the 
serialized format.  This can be freeing, at some complexity.  But, 
back to your question.

Every one of the serialized formats comes with some advantages and 
some disadvantages.  Some are easy.  Some are flexible.  Other
formats are structured with bindings in many languages.  Some 
are tied closely to a single language or even specific language 
versions.  Some formats are even defined by a single application or 
program that somebody has written.

What about pickle and shelve?  Where do they fit?

Both pickle and shelve are well-maintained and older Python-specific 
formats that allow you to serialize Python objects and data 
structures to disk.  This is extremely convenient if you are 
unlikely to change Python versions or to change your data 
structures.  Need your program to "remember" something from a prior 
run?  When it starts up, it can read a ./state.pickle straight into 
memory, pick up where it left off and perform some operation, and 
then, when complete, save the dat astructure back to ./state (or 
more safely to a new file ./state.$timestamp) and exit.

This is a convenient way to store Python objects and data 
structures.

Advantage:  Native Python.  Dead simple to use (you still have to be 
  careful about file-writing logic, overwriting old files can be 
  bad, but it's a bit up to you).  You can dump many Python data 
  structures and objects to disk.

Disadvantages: Files are only be readable by Python (excluding 
  motivated implementers in other languages).

If you would like to use pickle or shelve, please ask again on this 
list for specific advice on these.  The shelve module is intended to 
make it easy to have a data structure in memory that is backed by a 
data file on disk.  This is very similar to what the dbm module also 
offers.

The pickle module is more geared toward loading an entire data 
structure from the disk into memory.

There are other options, that have been used for decades (see below 
my sig for an incomplete and light-hearted history of serialization 
in the digital world).

The option for serialization formats and accessing them from Python 
are many.  Pickle and shelve are very Python specific, but will be 
very easy to use and will be more forgiving if you happen to try to 
store some code as well as "pure" data.

If you are going to need to exchange data with other programs, 
consider JSON.  Reading and writing to JSON format is as easy as 
reading and writing to a shelve (which is a Python pickle format 
under the hood).  Here's a two liner that will take the environment 
of a running program and dump that into a human- and 
machine-readable JSON format.  Step A:

  import os, sys, json
  json.dump(dict(os.environ), sys.stdout, indent=4, sort_keys=True)

Now, let's say that you want to read that in another program (and 
I'll demonstrate just dumping the in-memory representation to your 
terminal).  Step B:

  import sys, json, pprint
  pprint.pprint(json.load(sys.stdin))

So, going back to your original question.

>I am not getting the concept of pickle and shelves in python, I 
>mean what's the use of both the concepts, when to use them in code 
>instead of using file read and write operations.

You can, of course, use file read / write operations whenever you 
need to load data into memory from the disk (or "somewhere") or to 
write data from memory into the disk (or "somewhere").

The idea behind tools and libraries like...

   pickle and shelve (which are Python-specific),
   JSON (flexible and used by many languages and web applications),
   XML (massively flexible, but offering an astonishing 
       mind-boggling array of controls and tools on the data)
   text or binary files (ultimate flexibility and responsibility for 
       the application developer)

...is to make your job as a programmer easier, by offering you 
libraries and standards that support some of the difficult aspects 
of data validation, encoding and even logic.  Each format can impose 
some limitation -- and choosing the right tool can be tricky, 
because it depends on how much control you need, how many languages 
you need to support and what you're hoping to accomplish.

If you are unsure and you are just beginning to explore 
serialization options, I would suggest learning very well how to 
read and write plain text files for anything intended for humans.  

For machine to machine communication, JSON is an excellent choice 
today, as you will find a wide array of tools, great support from 
the Python standard library and a direct mapping from any JSON you 
find in the world to a data structure that you can load into memory.  

If a future application has very rigid rules, you can layer on logic 
using JSON-Schema or you can look at something like XML, which is 
used for a fair number of complex data interchange formats.

In summary, if you scratched very deeply under the hood, you'll see 
that pickle and shelve both call file.read() and file.write() but 
they are hiding all of the logic and complexity of turning your data 
structure into a file on disk.  The json.load() and json.dump() 
calls are doing the same thing.  There are equivalants in the dbm, 
csv and a few of the XML modules.

So, the pickle and shelve are just well-tested and implemented 
libraries to make your job easier, as are many of these other data 
serialization libraries.

I hope the above was helpful in explaining some of the concepts.

-Martin

 [0] https://en.wikipedia.org/wiki/Serialization#Pickle
 [1] Please note, there are probably better places to read about JSON, XML and
     CSV, but the point I'm making here is that they are standardized
     serialization and interchange formats and are reasonably well-specified (with
     the historical exception of CSV).
     JSON:  https://tools.ietf.org/html/rfc8259
      CSV: https://tools.ietf.org/html/rfc4180
      XML: https://tools.ietf.org/html/rfc3076

A skewed and degraded history of serialization formats could look something
like this:

 1950s:  You'll need to write a program (on punchcards) to read 
         octal punchouts from this deck of cards.  Or from this 
         super-fancy new magnetic tape drive.  Oh, you wanted it in 
         a searchable data structure?  Well...you'd better put that 
         logic into your application.
 1960s:  Now that we have less expensive disks, how about giving 
         standard input, standard output to each program and having 
         ASCII files. Wouldn't that be cool?
 1970s:  Yo, people -- could we come up with common structures for 
         storing files on disk so we don't have to write the same 
         logic in every application?  Ok...how about ... CSV?
 1980s:  Hello, ISO!  Could we come up with a way to define 
         serialization encodings and formats that are more flexible 
         and shared, since this network thing seems to be starting 
         to reach across national boundaries?  Oh, yes, there's 
         Standard Generalized Markup Language (SGML)
 1990s:  SGML seems a bit complicated and there's this effort to 
         standardize on encodings ("Unicode", eh?).  Can we make 
         SGML tighter?  Yes, it was a good start, but we can tighten 
         it up, let's call it XML.
 2000s:  Wouldn't it be nice to have a flexible and dynamic, 
         Unicode-encoded, multilingual serialization format that we 
         could shoot over the network, store on disks and move 
         between processes that we have running in browsers in many 
         places?  Yes!  Let's call it JSON, JavaScript Object 
         Notation.

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list