[Tutor] regex sub/super patterns ( was:advice on regex matching for dates?)

spir denis.spir at free.fr
Fri Dec 12 17:40:39 CET 2008


Lie Ryan a écrit :

 >> I just found a simple, but nice, trick to make regexes less unlegible.
 >> Using substrings to represent sub-patterns. E.g. instead of:
 >>
 >> p =
 >> re.compile(r'(?P<month>January|February|March|April|May|June|July|
 > August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>
 > \d{4})')
 >> write first:
 >>
 >> month =
 >> r'(?P<month>January|February|March|April|May|June|July|August|September|
 > October|November|December)'
 >> day = r'(?P<day>\d{1,2})'
 >> year = r'(?P<year>\d{4})'
 >>
 >> then:
 >> p = re.compile( r"%s\s%s,\s%s" % (month,day,year) ) or even:
 >> p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" %
 >> {'month':month,'day':day,'year':year} )
 >>
 >
 > You might realize that that wouldn't help much and might even hide bugs
 > (or make it harder to see). regex IS hairy in the first place.
 >
 > I'd vote for namespace in regex to allow a pattern inside a special pair
 > tags to be completely ignorant of anything outside it. And also I'd vote
 > for a compiled-pattern to be used as a subpattern. But then, with all
 > those limbs, I might as well use a full-blown syntax parser instead.

Just tried to implement something like that for fun. The trick I chose is to 
reuse the {} in pattern strings, as (as far as I know) the curly braces are 
used only for repetition; hence there should be not conflict. A format would 
then look like that:

	format = r"...{subpatname}...{subpatname}..."

For instance:

	format = r"record: {num}{n} -- name:{id}"

(Look at the code below.)

Found 2 issues:
-1- Regex patterns haven't any 'string' attribute to recover the string they 
have beeen built from. As they aren't python objects, we cannot set anything on 
them. So that we need to play with strings directly. (Which means that, once 
the sub-pattern tested, we must copy its string.)
-2- There must be a scope to look for subpattern strings per name. the only 
practicle solution I could think at is to make super-patterns ,of full 
grammars, instances of a class that copes with the details. Subpattern strings 
have to be attrs of this object.

_______________________________________________
from re import compile as pattern
class Grammar(object):
	identifier = "[a-zA-Z_][a-zA-Z_0-9]*"
	sub_pattern = pattern("\{%s\}" % identifier)
	def subString(self,result):
		name = result.group()[1:-1]
		try:
			return self.__dict__[name]
		except KeyError:
			raise AttributeError("Cannot find sub-pattern string '%s'."	% name)
	def makePattern(self, format=None):
		format = self.format if format is None else format
		self.pattern_string = Grammar.sub_pattern.sub(self.subString,format)
		print "%s -->\n%s" % (self.format,self.pattern_string)
		self.pattern = pattern(self.pattern_string)
		return self.pattern

if __name__ == "__main__":
	record = Grammar()
	record.num	= r"n[°\.oO]\ ?"
	record.n 	= r"[0-9]+"
	record.id 	= r"[a-zA-Z_][a-zA-Z_0-9]*"
	record.format = r"record: {num}{n} -- name:{id}"
	record.makePattern()
	text = """
	record: no 123 -- name:blah
	record: n.456 -- name:foo
	record: n°789 -- name:foo_bar
	"""
	result = record.pattern.findall(text)
	print result,'\n'
	# with format attr and invalid format
	bloo_format = r"record: {num}{n} -- name:{id}{bloo_bar}"
	record.makePattern(bloo_format)
	result = record.pattern.findall(text)
	print result




More information about the Tutor mailing list