pattern matching

Wed Feb 23 21:33:03 EST 2011

In article <kaidnYsdesvyI_jQnZ2dnUVZ_qGdnZ2d at insightbb.com>,
 monkeys paw <monkey at joemoney.net> wrote:

> if I have a string such as '<td>01/12/2011</td>' and i want
> to reformat it as '20110112', how do i pull out the components
> of the string and reformat them into a YYYYDDMM format?
> 
> I have:
> 
> import re
> 
> test = re.compile('\d\d\/')
> f = open('test.html')  # This file contains the html dates
> for line in f:
>      if test.search(line):
>          # I need to pull the date components here

My first thought is that any attempt to parse HTML by using regex is 
doomed to failure.  HTML is meant to be parsed by an HTML parser.  
Python gives you several to pick from; the best that I know of is the 
third-party lxml package (http://lxml.de/).

My second thought is that my first thought was correct.