Regex single quotes in scraper script?

Rock rock at py-nosp
Sat Jul 17 09:58:42 EDT 2004


"Christopher T King" <squirrel at WPI.EDU> wrote in message
news:Pine.LNX.4.44.0407170051220.27083-100000 at ccc2.wpi.edu...
> On Fri, 16 Jul 2004, Rock wrote:
>
> > Being a real newbie with this I think I found the area of code that
parses
> > the href.  It is in a file called parsefns.py
> > the full excerpt is listed below but here is the regex line that I
believe
> > is not dealing with single quote.
> >
> > m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
> >
> > I have tried many different variations but no luck and no luck getting
hold
> > of the author.  Any ideas?  Thx.
>
> Good job tracking that down.  Methinks you'll want to change it to read
> thusly:
>
> m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)
>

woohoo! that fixed my problem with single quotes sites and double quotes
still seem to still work just fine.

Thanks man.


> This will possibly break some sites, though (namely those that use single
> quotes in their URLs, but those are broken anyways).  A proper fix would
> require a tad more work (i.e. either a much, much, messier regex or a
> change in the function), and it's really late right now ;)
>





More information about the Python-list mailing list