some kind of detector, need advices...

Thu Jul 15 17:49:52 EDT 2004

I'm not really sure I followed that either, but here's a restatement
of the problem according to my feeble understanding-

Joh wants to break a string into tokens (in this case, just separated
by whitespace) and then perform pattern matching based on the tokens. 
He then wants to find the span of matched patterns, in terms of token
numbers.

For example, given the input "this is a test string" and the pattern
"is .* test", the program will first tokenize the string:

  tokens = ['this', 'is', 'a', 'test', 'string']

...and then will match ['is', 'a', test'] and return the indices of
the matched interval [(1, 3)].   (I don't know if that interval is
supposed to be inclusive)

Note that .* would match zero or more tokens, not characters. 
Additionally, it seems that xml-ish tags ("<atag>" in the example)
would be ignored.

Implementing this from scratch would require hacking together some
kind of LR parser.  Blah.  Fortunately, because tokenization is
trivial, it is possible translate all of the "detectors" (aka
token-matching patterns) directly into regular expressions; then, all
you have to do is correlate the match object intervals (indexed by
character) into token intervals (indexed by token).

To translate a "detector" into a proper regular expression, just make
a few substitutions:

def detector_to_re(detector):
   """ translate a token pattern into a regular expression """
   # could be more efficient, but this is good for readability

   # "." => "(\S+)"    (match one token)
   detector = re.sub('\.', r'(\S+)', detector)

   # whitespace block => "(\s+)"  (match a stretch of whitespace)
   detector = re.sub('\s+', r'(\s+)', detector)

   return detector

def apply_detector(detector, data):

  # compile mapping of character -> token indices
  i = 0
  token_indices = {}
  for match in re.finditer('\S+', data):
    token_indices[match.start()] = i
    token_indices[match.end()] = i
    i += 1

  # ignore tags
  data = re.sub('<.*?>', '', data)

  detector_re = re.compile(detector_to_re(detector))

  intervals = []

  for match in detector_re.finditer():
     intervals.append( 
       (
         token_indices[match.start()], 
         token_indices[match.end()] 
       )
     ) 

  return intervals

On second thought, probably best to throw out this whole scheme and
just use regular expressions, even if it's less convenient.  This way
you don't mix syntaxes of regexp vs. token exp (like "th.* *" would
do...)