Traversal help

R. Bryan Smith r.bryan.smith at gmail.com
Sun Apr 2 21:46:07 EDT 2017


Hello,

I am working with Python 3.6.  I’ve been trying to figure out a solution to my question for about 40 hrs with no success and hundreds of failed attempts.  Essentially, I have bitten off way more than I can chew with processing this file.  Most of what follows, is my attempt to inform as best I can figure.

I have a JSONL (new line) file that I downloaded using requests and the following code: 
with open(fname, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=1024):
        fd.write(chunk)

The file was in gzip format (the encoding on the API says - UTF-8) to a windows 8 (it’s current, but 8) machine.
These files are rather large, maybe around 4GB.
I used the ‘shebang' for ‘UTF-8’ at the top of my Python program: # -*- encoding: utf-8 -*-

After I save the file, I read it using this:
def read_json(path):
    '''Turns a normal json LINES (cr) file into an array of objects'''
    temp_array = []
    f = codecs.open(path, 'r', 'utf-8', ‘backslashreplace')
    for line in f:
        record = json.loads(line.strip('\n|\r'))
        temp_array.append(record)
    return temp_array

I am working on a Linux server to partitioning the List returned above, there are three linked levels of detail (A, B, C) that can exist in any collection within the JSON and each Collection can contain wildly varying and/or repeating fields.  The data contained within is scraped from websites all over the world.  I wanted to ‘traverse' the file structure and found an algorithm that I think will work:

def traverser(obj, path=None, callback=None):
    if path is None:
        path = []
        
    if isinstance(obj, dict):
        value = {k: traverser(v, path+[k], callback)
                 for k, v in obj.items()}
    elif isinstance(obj, list):
        value = [traverser(elem, path+[[]], callback)
                 for elem in obj]
    else:
        value = obj
    
    if callback is None:
        return value
    else:
        return callback(path, value)

The only problem and the subsequent question that follows is:  I have yet to successfully decode / How do I then ‘collect’ each of these objects while I am traversing the JSON New Line Collection into some sort of container (handling encoding errors) so that I can then write to a csv file (w/ ‘utf-8’ and won’t error out when I try to import it into a IBM ‘utf-8’ encoded DB)?  Actually, after that, I would like to learn how to grab a specific element, if present in each Collection, whenever I need it, as well - but, that can wait.

I’ve tried using the JSON module on the JSONL file, but the structure is really complicated and changing with lot’s of different control and spacing characters, in addition to some odd (potentially non-unicode characters).  Here’s the schema: http://json-schema.org/fraft-04/schema# <http://json-schema.org/fraft-04/schema#> 

I’m not a programmer, but I am learning through assimilation.  Any help is greatly appreciated.  Even if it’s pointing me to documentation that can help me learn what to consider and lead me to what to do.  

Thank you,
R. Smith




More information about the Python-list mailing list