Decoding a huge JSON file incrementally

Thu Dec 20 12:26:46 EST 2018

On Thu, 20 Dec 2018 at 17:22, Chris Angelico <rosuav at gmail.com> wrote:
>
> On Fri, Dec 21, 2018 at 2:44 AM Paul Moore <p.f.moore at gmail.com> wrote:
> >
> > I'm looking for a way to incrementally decode a JSON file. I know this
> > has come up before, and in general the problem is not soluble (because
> > in theory the JSON file could be a single object). In my particular
> > situation, though, I have a 9GB file containing a top-level array
> > object, with many elements. So what I could (in theory) do is to parse
> > an element at a time, yielding them.
> >
> > The problem is that the stdlib JSON library reads the whole file,
> > which defeats my purpose. What I'd like is if it would read one
> > complete element, then just enough far ahead to find out that the
> > parse was done, and return the object it found (it should probably
> > also return the "next token", as it can't reliably push it back - I'd
> > check that it was a comma before proceeding with the next list
> > element).
>
> It IS possible to do an incremental parse, but for that to work, you
> would need to manually strip off the top-level array structure. What
> you'd need to use would be this:
>
> https://docs.python.org/3/library/json.html#json.JSONDecoder.raw_decode
>
> It'll parse stuff and then tell you about what's left. Since your data
> isn't coming from a ginormous string, but is coming from a file,
> you're probably going to need something like this:
>
> def get_stuff_from_file(f):
>     buffer = ""
>     dec = json.JSONDecoder()
>     while "not eof":
>         while "no object yet":
>             try: obj, pos = dec.raw_decode(buffer)
>             except JSONDecodeError: buffer += f.read(1024)
>             else: break
>         yield obj
>         buffer = buffer[pos:].lstrip().lstrip(",")
>
> Proper error handling is left as an exercise for the reader, both in
> terms of JSON errors and file errors. Also, the code is completely
> untested. Have fun :)
>
> The basic idea is that you keep on grabbing more data till you can
> decode an object, then you keep whatever didn't get used up ("pos"
> points to whatever didn't get consumed). Algorithmic complexity should
> be O(n) as long as your objects are relatively small, and you can
> optimize disk access by tuning your buffer size to be at least the
> average size of an object.
>
> Hope that helps.
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list