Regex single quotes in scraper script?

Christopher T King squirrel at WPI.EDU
Sat Jul 17 01:07:07 EDT 2004


On Fri, 16 Jul 2004, Rock wrote:

> Being a real newbie with this I think I found the area of code that parses
> the href.  It is in a file called parsefns.py
> the full excerpt is listed below but here is the regex line that I believe
> is not dealing with single quote.
> 
> m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
> 
> I have tried many different variations but no luck and no luck getting hold
> of the author.  Any ideas?  Thx.

Good job tracking that down.  Methinks you'll want to change it to read 
thusly:

m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)

This will possibly break some sites, though (namely those that use single
quotes in their URLs, but those are broken anyways).  A proper fix would 
require a tad more work (i.e. either a much, much, messier regex or a 
change in the function), and it's really late right now ;)




More information about the Python-list mailing list