Extracting attributes from compiled python code or parse trees

Mon Jul 23 18:43:33 EDT 2007

En Mon, 23 Jul 2007 18:13:05 -0300, Matteo <mahall at ncsa.uiuc.edu> escribió:

> I am trying to get Python to extract attributes in full dotted form
> from compiled expression. For instance, if I have the following:
>
> param = compile('a.x + a.y','','single')
>
> then I would like to retrieve the list consisting of ['a.x','a.y'].
>
> The reason I am attempting this is to try and automatically determine
> data dependencies in a user-supplied formula (in order to build a
> dataflow network). I would prefer not to have to write my own parser
> just yet.

If it is an expression, I think you should use "eval" instead of "single"  
as the third argument to compile.

> Alternatively, I've looked at the parser module, but I am experiencing
> some difficulties in that the symbol list does not seem to match that
> listed in the python grammar reference (not surprising, since I am
> using python2.5, and the docs seem a bit dated)

Yes, the grammar.txt in the docs is a bit outdated (or perhaps it's a  
simplified one), see the Grammar/Grammar file in the Python source  
distribution.

> In particular:
>
>>>> import parser
>>>> import pprint
>>>> import symbol
>>>> tl=parser.expr("a.x").tolist()
>>>> pprint.pprint(tl)
>
> [258,
>  [326,
>   [303,
>    [304,
>     [305,
>      [306,
>       [307,
>        [309,
>         [310,
>          [311,
>           [312,
>            [313,
>             [314,
>              [315,
>               [316, [317, [1, 'a']], [321, [23, '.'], [1,
> 'x']]]]]]]]]]]]]]]],
>  [4, ''],
>  [0, '']]
>
>>>> print symbol.sym_name[316]
> power
>
> Thus, for some reason, 'a.x' seems to be interpreted as a power
> expression, and not an 'attributeref' as I would have anticipated (in
> fact, the symbol module does not seem to contain an 'attributeref'
> symbol)

Using this little helper function to translate symbols and tokens:

names = symbol.sym_name.copy()
names.update(token.tok_name)
def human_readable(lst):
   lst[0] = names[lst[0]]
   for item in lst[1:]:
     if isinstance(item,list):
       human_readable(item)

the same tree becomes:

['eval_input',
  ['testlist',
   ['test',
    ['or_test',
     ['and_test',
      ['not_test',
       ['comparison',
        ['expr',
         ['xor_expr',
          ['and_expr',
           ['shift_expr',
            ['arith_expr',
             ['term',
              ['factor',
               ['power',
                ['atom', ['NAME', 'a']],
                ['trailer', ['DOT', '.'], ['NAME', 'x']]]]]]]]]]]]]]]],
  ['NEWLINE', ''],
  ['ENDMARKER', '']]

which is correct is you look at the symbols in the (right) Grammar file.

But if you are only interested in things like a.x, maybe it's a lot  
simpler to use the tokenizer module, looking for the NAME and OP tokens as  
they appear in the source expression.

-- 
Gabriel Genellina