Regular Expression help

Edward Elliott nobody at 127.0.0.1
Thu Apr 27 15:09:16 EDT 2006


RunLevelZero wrote:
> 10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
> Price Is Right</a><em>
> 
> All I want is " Price Is Right "
> 
> Here is the re.
> 
> findshows =
> re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')

1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens.  Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching. 
If you don't need the time, don't match it in the first place.  If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'.  Not
as exact but a bit simpler.  Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately.  Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.


> I have used a for loop to remove the extra data but then it ruins the
> list that I am building.  Basically I want the list to be something
> like this.
> 
> [[Government Access], [Price Is Right, Guiding Light, Another show]]
> 
> the for loop just comma deliminates all of them so I lose the list in a
> list that I need.  I hope I have explained this well enough.  Any help
> or ideas would be appreciated.

No one can help with that unless you show us how you're building your list.





More information about the Python-list mailing list