MemoryError and Pickle

Mon Nov 21 20:40:19 EST 2016

On Tue, 22 Nov 2016 10:27 am, Fillmore wrote:

> 
> Hi there, Python newbie here.
> 
> I am working with large files. For this reason I figured that I would
> capture the large input into a list and serialize it with pickle for
> later (faster) usage.
> Everything has worked beautifully until today when the large data (1GB)
> file caused a MemoryError :(

At what point do you run out of memory? When building the list? If so, then
you need more memory, or smaller lists, or avoid creating a giant list in
the first place.

If you can successfully build the list, but then run out of memory when
trying to pickle it, then you may need another approach.

But as always, to really be sure what is going on, we need to see the full
traceback (not just the "MemoryError" part) and preferably a short, simple
example that replicates the error:

http://www.sscce.org/

> Question for experts: is there a way to refactor this so that data may
> be filled/written/released as the scripts go and avoid the problem?

I'm not sure what you are doing with this data. I guess you're not just:

- read the input, one line at a time
- create a giant data list
- pickle the list

and then never look at the pickle again.

I imagine that you want to process the list in some way, but how and where
and when is a mystery. But most likely you will later do:

- unpickle the list, creating a giant data list again
- process the data list

So I'm not sure what advantage the pickle is, except as make-work. Maybe
I've missed something, but if you're running out of memory processing the
giant list, perhaps a better approach is:

- read the input, one line at a time
- process that line

and avoid building the giant list or the pickle at all.

> code below.
> 
> Thanks
>
> data = list()
> for line in sys.stdin:
>      try:
>          parts = line.strip().split("\t")
>          t = parts[0]
>          w = parts[1]
>          u = parts[2]
>          #let's retain in-memory copy of data
>          data.append({"ta": t,
>                       "wa": w,
>                       "ua": u
>          })
>      except IndexError:
>          print("Problem with line :"+line, file=sys.stderr)
>          pass
> 
> #time to save data object into a pickle file
> 
> fileObject = open(filename,"wb")
> pickle.dump(data,fileObject)
> fileObject.close()

Let's re-write some of your code to make it better:

data = []
for line in sys.stdin:
    try:
        t, w, u = line.strip().split("\t")
    except ValueError as err:
        print("Problem with line:", line, file=sys.stderr)
    data.append({"ta": t, "wa": w, "ua": u})

with open(filename, "wb") as fileObject:
    pickle.dump(data, fileObject)

Its not obvious where you are running out of memory, but my guess is that it
is most likely while building the giant list. You have a LOT of small
dicts, each one with exactly the same set of keys. You can probably save a
lot of memory by using a tuple, or better, a namedtuple.

py> from collections import namedtuple
py> struct = namedtuple("struct", "ta wa ua")
py> x = struct("abc", "def", "ghi")
py> y = {"ta": "abc", "wa": "def", "ua": "ghi"}
py> sys.getsizeof(x)
36
py> sys.getsizeof(y)
144

So each of those little dicts {"ta": t, "wa": w, "ua": u} in your list
potentially use as much as four times the memory as a namedtuple would use.
So using namedtuple might very well save enough memory to avoid the
MemoryError altogether.

from collections import namedtuple
struct = namedtuple("struct", "ta wa ua")
data = []
for line in sys.stdin:
    try:
        t, w, u = line.strip().split("\t")
    except ValueError as err:
        print("Problem with line:", line, file=sys.stderr)
    data.append(struct(t, w, a))

with open(filename, "wb") as fileObject:
    pickle.dump(data, fileObject)

And as a bonus, when you come to use the record, instead of having to write:

    line["ta"]

to access the first field, you can write:

    line.ta

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.