[Tutor] advice on regex matching for dates?

Fri Dec 12 12:43:46 CET 2008

On Thu, 11 Dec 2008 23:38:52 +0100, spir wrote:

> Serdar Tumgoren a écrit :
>> Hey everyone,
>> 
>> I was wondering if there is a way to use the datetime module to check
>> for variations on a month name when performing a regex match?
>> 
>> In the script below, I created a regex pattern that checks for dates in
>> the following pattern:  "August 31, 2007". If there's a match, I can
>> then print the capture date and the line from which it was extracted.
>> 
>> While it works in this isolated case, it struck me as not very
>> flexible. What happens when I inevitably get data that has dates
>> formatted in a different way? Do I have to create some type of library
>> that contains variations on each month name (i.e. - January, Jan., 01,
>> 1...) and use that to parse each line?
>> 
>> Or is there some way to use datetime to check for date patterns when
>> using regex? Is there a "best practice" in this area that I'm unaware
>> of in this area?
>> 
>> Apologies if this question has been answered elsewhere. I wasn't sure
>> how to research this topic (beyond standard datetime docs), but I'd be
>> happy to RTM if someone can point me to some resources.
>> 
>> Any suggestions are welcome (including optimizations of the code
>> below).
>> 
>> Regards,
>> Serdar
>> 
>> #!/usr/bin/env python
>> 
>> import re, sys
>> 
>> sourcefile = open(sys.argv[1],'r')
>> 
>> pattern =
>> re.compile(r'(?P<month>January|February|March|April|May|June|July|
August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>
\d{4})')
>> 
>> pattern2 = re.compile(r'Return to List')
>> 
>> counter = 0
>> 
>> for line in sourcefile:
>>     x = pattern.search(line)
>>     break_point = pattern2.match(line)
>> 
>>     if x:
>>         counter +=1
>>         print "%s %d, %d <== %s" % ( x.group('month'),
>>         int(x.group('day')),
>> int(x.group('year')), line ),
>>     elif break_point:
>>         break
>> 
>> print counter
>> sourcefile.close()
> 
> I just found a simple, but nice, trick to make regexes less unlegible.
> Using substrings to represent sub-patterns. E.g. instead of:
> 
> p =
> re.compile(r'(?P<month>January|February|March|April|May|June|July|
August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>
\d{4})')
> 
> write first:
> 
> month =
> r'(?P<month>January|February|March|April|May|June|July|August|September|
October|November|December)'
> day = r'(?P<day>\d{1,2})'
> year = r'(?P<year>\d{4})'
> 
> then:
> p = re.compile( r"%s\s%s,\s%s" % (month,day,year) ) or even:
> p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" %
> {'month':month,'day':day,'year':year} )
> 

You might realize that that wouldn't help much and might even hide bugs 
(or make it harder to see). regex IS hairy in the first place. 

I'd vote for namespace in regex to allow a pattern inside a special pair 
tags to be completely ignorant of anything outside it. And also I'd vote 
for a compiled-pattern to be used as a subpattern. But then, with all 
those limbs, I might as well use a full-blown syntax parser instead.