[Tutor] Pickles and Shelves Concept
Martin A. Brown
martin at linux-ip.net
Sat Jun 8 01:11:55 EDT 2019
Hello,
>I am not getting the concept of pickle and shelves in python, I
>mean what's the use of both the concepts, when to use them in code
>instead of using file read and write operations.
>
>Could anyone please explain me the concepts.
I see you already have two answers (from David Rock and Alan
Gauld).
I will add a slightly different answer and try also to explain some
of the history (at a very high level).
* Programs need to take input from "somewhere".
* Programs need data structures in memory on which to operate.
There are many different ways to store data "somewhere" and in order
to create the data structures in memory (on which your program will
operate), you need to have code that knows how to read that data
from "somewhere".
So, there's data that has been written to disk. I often call this
the serialized form form of the data [0]. There different usages
for serialization, but ones I'll talk about are the serialized
formats that we typically write to disk to store data for a program
to read. Here are a few such serialized formats:
* JSON or XML (homage: SGML)
* pickles and shelves
* GNU dbm, ndbm, cdb, rocksdb and probably 1 million others
* custom binary formats
* plain text files (be careful with your encoding...)
Digression: You might ask... what about SQL? Technically, the
serialization is something that the SQL database software takes care
of and your application doesn't. So no need to know about the
serialized format. This can be freeing, at some complexity. But,
back to your question.
Every one of the serialized formats comes with some advantages and
some disadvantages. Some are easy. Some are flexible. Other
formats are structured with bindings in many languages. Some
are tied closely to a single language or even specific language
versions. Some formats are even defined by a single application or
program that somebody has written.
What about pickle and shelve? Where do they fit?
Both pickle and shelve are well-maintained and older Python-specific
formats that allow you to serialize Python objects and data
structures to disk. This is extremely convenient if you are
unlikely to change Python versions or to change your data
structures. Need your program to "remember" something from a prior
run? When it starts up, it can read a ./state.pickle straight into
memory, pick up where it left off and perform some operation, and
then, when complete, save the dat astructure back to ./state (or
more safely to a new file ./state.$timestamp) and exit.
This is a convenient way to store Python objects and data
structures.
Advantage: Native Python. Dead simple to use (you still have to be
careful about file-writing logic, overwriting old files can be
bad, but it's a bit up to you). You can dump many Python data
structures and objects to disk.
Disadvantages: Files are only be readable by Python (excluding
motivated implementers in other languages).
If you would like to use pickle or shelve, please ask again on this
list for specific advice on these. The shelve module is intended to
make it easy to have a data structure in memory that is backed by a
data file on disk. This is very similar to what the dbm module also
offers.
The pickle module is more geared toward loading an entire data
structure from the disk into memory.
There are other options, that have been used for decades (see below
my sig for an incomplete and light-hearted history of serialization
in the digital world).
The option for serialization formats and accessing them from Python
are many. Pickle and shelve are very Python specific, but will be
very easy to use and will be more forgiving if you happen to try to
store some code as well as "pure" data.
If you are going to need to exchange data with other programs,
consider JSON. Reading and writing to JSON format is as easy as
reading and writing to a shelve (which is a Python pickle format
under the hood). Here's a two liner that will take the environment
of a running program and dump that into a human- and
machine-readable JSON format. Step A:
import os, sys, json
json.dump(dict(os.environ), sys.stdout, indent=4, sort_keys=True)
Now, let's say that you want to read that in another program (and
I'll demonstrate just dumping the in-memory representation to your
terminal). Step B:
import sys, json, pprint
pprint.pprint(json.load(sys.stdin))
So, going back to your original question.
>I am not getting the concept of pickle and shelves in python, I
>mean what's the use of both the concepts, when to use them in code
>instead of using file read and write operations.
You can, of course, use file read / write operations whenever you
need to load data into memory from the disk (or "somewhere") or to
write data from memory into the disk (or "somewhere").
The idea behind tools and libraries like...
pickle and shelve (which are Python-specific),
JSON (flexible and used by many languages and web applications),
XML (massively flexible, but offering an astonishing
mind-boggling array of controls and tools on the data)
text or binary files (ultimate flexibility and responsibility for
the application developer)
...is to make your job as a programmer easier, by offering you
libraries and standards that support some of the difficult aspects
of data validation, encoding and even logic. Each format can impose
some limitation -- and choosing the right tool can be tricky,
because it depends on how much control you need, how many languages
you need to support and what you're hoping to accomplish.
If you are unsure and you are just beginning to explore
serialization options, I would suggest learning very well how to
read and write plain text files for anything intended for humans.
For machine to machine communication, JSON is an excellent choice
today, as you will find a wide array of tools, great support from
the Python standard library and a direct mapping from any JSON you
find in the world to a data structure that you can load into memory.
If a future application has very rigid rules, you can layer on logic
using JSON-Schema or you can look at something like XML, which is
used for a fair number of complex data interchange formats.
In summary, if you scratched very deeply under the hood, you'll see
that pickle and shelve both call file.read() and file.write() but
they are hiding all of the logic and complexity of turning your data
structure into a file on disk. The json.load() and json.dump()
calls are doing the same thing. There are equivalants in the dbm,
csv and a few of the XML modules.
So, the pickle and shelve are just well-tested and implemented
libraries to make your job easier, as are many of these other data
serialization libraries.
I hope the above was helpful in explaining some of the concepts.
-Martin
[0] https://en.wikipedia.org/wiki/Serialization#Pickle
[1] Please note, there are probably better places to read about JSON, XML and
CSV, but the point I'm making here is that they are standardized
serialization and interchange formats and are reasonably well-specified (with
the historical exception of CSV).
JSON: https://tools.ietf.org/html/rfc8259
CSV: https://tools.ietf.org/html/rfc4180
XML: https://tools.ietf.org/html/rfc3076
A skewed and degraded history of serialization formats could look something
like this:
1950s: You'll need to write a program (on punchcards) to read
octal punchouts from this deck of cards. Or from this
super-fancy new magnetic tape drive. Oh, you wanted it in
a searchable data structure? Well...you'd better put that
logic into your application.
1960s: Now that we have less expensive disks, how about giving
standard input, standard output to each program and having
ASCII files. Wouldn't that be cool?
1970s: Yo, people -- could we come up with common structures for
storing files on disk so we don't have to write the same
logic in every application? Ok...how about ... CSV?
1980s: Hello, ISO! Could we come up with a way to define
serialization encodings and formats that are more flexible
and shared, since this network thing seems to be starting
to reach across national boundaries? Oh, yes, there's
Standard Generalized Markup Language (SGML)
1990s: SGML seems a bit complicated and there's this effort to
standardize on encodings ("Unicode", eh?). Can we make
SGML tighter? Yes, it was a good start, but we can tighten
it up, let's call it XML.
2000s: Wouldn't it be nice to have a flexible and dynamic,
Unicode-encoded, multilingual serialization format that we
could shoot over the network, store on disks and move
between processes that we have running in browsers in many
places? Yes! Let's call it JSON, JavaScript Object
Notation.
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list