[Tutor] Regular Expression question

Scott Chapman scott_list@mischko.com
Fri Apr 18 14:58:01 2003


On Friday 18 April 2003 11:46, Michael Janssen wrote:
> On Thu, 17 Apr 2003, Scott Chapman wrote:
> > Is it possible to make a regular expression that will match:
> > '<html blah>' or '<html>'
> > without having to make it into two complete expressions seperated by a
> > pipe: r'<html[ \t].+?>|<html>'
> >
> > I want it to require a space or tab and at least one character before the
> > closing bracket, after 'html', or just the closing bracket.
>
> def test(expr):
>     for s in ('<html blah>','<html>', '<html:subtype>','<html >',
>               '<html tag1 tag2>'):
>         print "%-18s" % s,
>         mt = re.search(expr, s)
>         if mt:
>             print mt.group()
>         else: print
>
>
> test(r"<html([ \t][^ \t]+?)?>")
> <html blah>        <html blah>
> <html>             <html>
> <html:subtype>
> <html >
> <html tag1 tag2>
>
> r"<html([ \t][^ \t]+?)?>" has a group "([ \t][ \t]+?)" for one following
> space-tag-combination. This group can be given once or no times.
>
> NB: re is powerfull but not suficient for reallife html as Magnus has
> already stated today.
>
> Michael

Mike,
Thanks for the tip.  I'm moving forward with this as a first program in 
Python.  I'll have a good look at htmlParser shortly because of Magnus' post.  
I'm just using this as an exercise and doubt it will ever see production.

Thanks!
Scott