Regex single quotes in scraper script?
Rock
rock at py-nosp
Sat Jul 17 09:58:42 EDT 2004
"Christopher T King" <squirrel at WPI.EDU> wrote in message
news:Pine.LNX.4.44.0407170051220.27083-100000 at ccc2.wpi.edu...
> On Fri, 16 Jul 2004, Rock wrote:
>
> > Being a real newbie with this I think I found the area of code that
parses
> > the href. It is in a file called parsefns.py
> > the full excerpt is listed below but here is the regex line that I
believe
> > is not dealing with single quote.
> >
> > m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
> >
> > I have tried many different variations but no luck and no luck getting
hold
> > of the author. Any ideas? Thx.
>
> Good job tracking that down. Methinks you'll want to change it to read
> thusly:
>
> m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)
>
woohoo! that fixed my problem with single quotes sites and double quotes
still seem to still work just fine.
Thanks man.
> This will possibly break some sites, though (namely those that use single
> quotes in their URLs, but those are broken anyways). A proper fix would
> require a tad more work (i.e. either a much, much, messier regex or a
> change in the function), and it's really late right now ;)
>
More information about the Python-list
mailing list