some kind of detector, need advices...

Thu Jul 15 02:37:37 EDT 2004

Hello,

(sorry long)

i think i have missed something in the code below, i would like to
design some kind of detector with python, but i feel totally in a no
way now and need some advices to advance :(

data = "it is an <atag> example of the kind of </atag> data it must
handle and another kind of data".split(" ")
(actually data are splitted line by line in a file, and contained
other than simple words so using ' '<space> is just to post here)

i would like to be able to write some kind of easy rule like :
detect1 = """th.* kind of data"""
or better :
detect2 = """th.* * data""" ### second '*' could be seen like a joker,
as in re, some sort of "skip zero or more line"
which would give me spans where it matched, here :
[(6, 11), (15, 19)]

i have written code below which may handle detect1 , but still unable
to adapt it to detect2. i think i may miss some step back in case of
failed match.

>>> def ignore(s):
	if s.startswith("<"):
		return True
	return False

>>> class Rule:
	def __init__(self, rule, separator = " "):
		self.rule = tuple(rule.split(separator))
		self.length = len(self.rule)
		self.compiled = []
		self.filled = 0
		for i in range(self.length):
			current = self.rule[i]
			if current == '*':
				###	special case, one may advance...
				self.compiled.append('*')
			else:
				self.filled += 1
				self.compiled.append(re.compile(current))
		self.compiled = tuple(self.compiled)
	###
	def match(self, lines, ignore = None):
		spans = []
		i, current, memorized, matched = 0, 0, None, None
		while 1:
			if i == len(lines):
				break
			line = lines[i]
			i += 1
			print "%3d: %s (%s)" % (i, line, current),
			if ignore and ignore(line):
				print ' - ignored'
				continue
			regexp = self.compiled[current]
			if regexp == '*':
				### HERE I NEED SOME ADVICES...
			elif hasattr(regexp, 'search') and regexp.search(line):
				###	match current pattern
				print ' + matched',
				matched = True
			else:
				current, memorized, matched = 0, None, None
			if matched:				
				if memorized is None:
					memorized = i - 1
				if current == self.filled - 1:
					print " + detected!",
					spans.append((memorized, i))
					current, memorized = 0, None
				current += 1
			print
		return spans

>>> data = "it is an <atag> example of the kind of </atag> data it
must handle and another kind of data".split(" ")
>>> detect = """th.* kind of data"""
>>> r = Rule(detect, ' ') ; r.match(data, ignore)
  1: it (0)
  2: is (0)
  3: an (0)
  4: <atag> (0)  - ignored
  5: example (0)
  6: of (0)
  7: the (0)  + matched
  8: kind (1)  + matched
  9: of (2)  + matched
 10: </atag> (3)  - ignored
 11: data (3)  + matched  + detected!
 12: it (1)
 13: must (0)
 14: handle (0)
 15: and (0)
 16: another (0)  + matched
 17: kind (1)  + matched
 18: of (2)  + matched
 19: data (3)  + matched  + detected!
[(6, 11), (15, 19)] ### actually they are indexes in list and +1 to
have line numbers