Which is the better way to parse this file?

Terry Reedy tjreedy at udel.edu
Tue Sep 2 15:42:34 EDT 2003


"Roberto A. F. De Almeida" <roberto at dealmeida.net> wrote in message
news:10c662fe.0309020909.57817c13 at posting.google.com...

> This is a Dataset Descriptor for the Data Access Protocol
> (http://www.unidata.ucar.edu/packages/dods/design/dap-rfc-html/), an
> API to access remote datasets. DAP servers describe their datasets
> using this grammar, and I'm developing a module to access DAP
servers.

OK.  The grammar is externally fixed and, I presume, constant across
DAP servers.  Even so, if the three collection types are functionally
the same for *your* purpose, you can parse according to a simplified
grammar.  For recursive descent, you only two parse functions: one
(recursive) for collections and one (terminal) for types.  And the
latter can be inlined in the former since only called at one place.

Start with a generator function (or iterator class) 'worderator'
initialized with a filename or input string that returns words/tokens
one at a time. Main problem is getting rid of ;s.  Also separating
obrack or cbrack ({}) from words if not alway blank-separated as in
your example.

Code that works on your sample data:

input= """dataset {
    int catalog_number;
    sequence {
       string experimenter;
       int32 time;
       structure {
          float64 latitude;
          float64 longitude;
       } location;
       sequence {
          float depth;
          float temperature;
       } xbt;
    } casts;
 } data;"""

def worderator(inp): # generator, need 2.2&future or 2.3
    for tok in inp.split():
       yield tok[-1] != ';' and tok or tok[:-1]

# check with: for i in worderator(input): print i,

obrack,cbrack = '{', '}'
toktype = {'string':str, 'int': int, 'int32': int, 'float': float,
'float64': float}

def collection(): # call after seeing collection keyword
    d = {}
    tok = word.next()
    if tok != obrack:
        raise ValueError("Expected %s, got %s" % (obrack,tok) )
    while 1:
       tok = word.next();
       if tok == cbrack:
           break
       elif tok in toktype:
           d[word.next()] = toktype[tok]
       elif tok == 'sequence' or tok == 'structure':
           nam,dic = collection()
           d[nam] = dic
       else:
            raise ValueError("Unexpected token: %s" % tok)
    return word.next(), d

word = worderator(input)
tok = word.next()
if  tok == 'dataset':
    data = collection()[1] # assume always want collection called
'data' regardless of input
else:
    raise ValueError("Started with %s instead of 'database'" % tok)

>>> pprint.pprint(data)
{'casts': {'experimenter': <type 'str'>,
           'location': {'latitude': <type 'float'>,
                        'longitude': <type 'float'>},
           'time': <type 'int'>,
           'xbt': {'depth': <type 'float'>, 'temperature': <type
'float'>}},
 'catalog_number': <type 'int'>}


Terry J. Reedy








More information about the Python-list mailing list