a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and c') =? ('a', 'bbb', 'c')

Phlip phlip2005 at gmail.com
Thu Nov 25 14:57:33 EST 2010


> Accepting input from a human is fraught with dangers and edge cases.

> Here's a non-regex solution

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

  http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

  flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

  Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

  def step_user_has_crypto_keys_(self, user, *keys):
      r'(\w+) has crypto keys (?:(\w+), )+and (\w+)'

      # assert that user with those keys here

That does not work because "a capturing group only remembers the last
match". This would appear to be an irritating 'feature' in Regexp. The
total match is 'apple, barley, and flax', but the stored groups behave
as if each () were a slot, so (\w+)+ would not store "more than one
group". Unless there's a Regexp workaround to mean "arbitrary number
of slots for each ()", then I /might/ go with this:

   got = re.findall(r'(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?
(?:(\w+), and )?(\w+)$', 'whatever a, bbb, and c')
   print got  #  [('a', '', '', '', 'bbb', 'c')]

The trick is to simply paste in a high number of (?:(\w+), )?
segments, assuming that nobody should plug in too many. Behavior
Driven Development scenarios should be readable and not run-on.
(Morelia has a table feature for when you actually need lots of
arguments.)

Next question: Does re.search() return a match object that I can get
('a', '', '', '', 'bbb', 'c') out of? The calls to groups() and such
always return this crazy ('a', 2, 'bbb', 'c') thing that would disturb
my user-programmers.

--
  Phlip



More information about the Python-list mailing list