Lexer/Parser question: TPG

Johannes Bauer dfnsonfsduifb at gmx.de
Mon Nov 14 16:13:41 EST 2016


Hi group,

this is not really a Python question, but I use Python to lex/parse some
input. In particular, I use the amazing TPG (http://cdsoft.fr/tpg/).
However, I'm now stuck at a point and am sure I'm not doing something
correctly -- since there's a bunch of really smart people here, I hope
to get some insights. Here we go:

I've created a minimal example in which I'm trying to parse some tokens
(strings and ints in the minimal example). Strings are delimited by
braces (). Therefore

(Foo) -> "Foo"

Braces inside braces are taken literally when balanced. If not balanced,
it's a parsing error.

(Foo (Bar)) -> "Foo (Bar)"

Braces may be escaped:

(Foo \)Bar) -> "Foo )Bar"

In my first (naive) attempt, I ignored the escaping and went with lexing
and then these rules:

token string_token '[^()]*';

[...]

String/s -> start_string                   $ s = ""
       (
           string_token/e                  $ s += e
           | String/e                      $ s += "(" + e + ")"
       )*
       end_string
       ;

While this worked a little bit (with some erroneous parsing,
admittedly), at least it *somewhat* worked. In my second attempt, I
tried to do it properly. I omitted the tokenization and instead used
inline terminals (which have precendence in TPG):

String/s -> start_string               $ s = ""
       (
           '\\.'/e                     $ s += "ESCAPED[" + e + "]"
           | '[^\\()]+'/e              $ s += e
           | String/e                  $ s += "(" + e + ")"
       )*
       end_string
       ;

(the "ESCAPED" part is just for demonstration to get the idea).

While the latter parser parses all strings perfectly, it now isn't able
to parse anything else anymore (including integer values!). Instead, it
appears to match the inline terminal '[^\\()]+' to my integer and then
dies (when trying, for example, to parse "12345"):

[  1][ 3]START.Expression.Value: (1,1) _tok_2 12345 != integer
[  2][ 3]START.Expression.String: (1,1) _tok_2 12345 != start_string
Traceback (most recent call last):
  File "example.py", line 56, in <module>
    print(Parser()(example))
  File "example/tpg.py", line 942, in __call__
    return self.parse('START', input, *args, **kws)
  File "example/tpg.py", line 1125, in parse
    return Parser.parse(self, axiom, input, *args, **kws)
  File "example/tpg.py", line 959, in parse
    value = getattr(self, axiom)(*args, **kws)
  File "<string>", line 3, in START
  File "<string>", line 14, in Expression
UnboundLocalError: local variable 'e' referenced before assignment

"_tok_2" seems to correspond to one of the inline terminal symbols, the
only one that fits would be '[^\\()]+'. But why would that *ever* match?
I thought it'd only match once a "start_string" was encountered (which
it isn't).

Since I'm the parsing noob, I don't think TPG (which is FREAKING
AMAZING, seriously!) is at fault but rather my understanding of TPG. Can
someone help me with this?

I've uploaded a complete working example to play around with here:

http://wikisend.com/download/642120/example.tar.gz
(if it's not working, please tell me and I'll look for some place else).

Thank you so much for your help,
Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1 at speranza.aioe.org>



More information about the Python-list mailing list