pyparsing Combine without merging sub-expressions

Sat Jan 20 15:49:52 EST 2007

Within a larger pyparsing grammar, I have something that looks like::

     wsj/00/wsj_0003.mrg

When parsing this, I'd like to keep around both the full string, and the 
AAA_NNNN substring of it, so I'd like something like::

     >>> foo.parseString('wsj/00/wsj_0003.mrg')
     (['wsj/00/wsj_0003.mrg', 'wsj_0003'], {})

How do I go about this? I was using something like::

     >>> digits = pp.Word(pp.nums)
     >>> alphas = pp.Word(pp.alphas)
     >>> wsj_name = pp.Combine(alphas + '_' + digits)
     >>> wsj_path = pp.Combine(alphas + '/' + digits + '/' + wsj_name +
     ... '.mrg')

But of course then all I get back is the full path::

     >>> wsj_path.parseString('wsj/00/wsj_0003.mrg')
     (['wsj/00/wsj_0003.mrg'], {})

I could leave off the final Combine and add a parse action::

     >>> wsj_path = alphas + '/' + digits + '/' + wsj_name + '.mrg'
     >>> def parse_wsj_path(string, index, tokens):
     ...     wsj_name = tokens[4]
     ...     return ''.join(tokens), wsj_name
     ...
     >>> wsj_path.setParseAction(parse_wsj_path)
     >>> wsj_path.parseString('wsj/00/wsj_0003.mrg')
     ([('wsj/00/wsj_0003.mrg', 'wsj_0003')], {})

But that then allows whitespace between the pieces of the path, which 
there shouldn't be::

     >>> wsj_path.parseString('wsj / 00 / wsj_0003.mrg')
     ([('wsj/00/wsj_0003.mrg', 'wsj_0003')], {})

How do I make sure no whitespace intervenes, and still have access to 
the sub-expression?

Thanks,

STeVe