[Tutor] RegEx query
Kent Johnson
kent37 at tds.net
Sat Dec 17 14:20:32 CET 2005
Liam Clarke wrote:
> Hi all,
>
> Using Beautiful Soup and regexes.. I've noticed that all the examples
> used regexes like so - anchors = parseTree.fetch("a",
> {"href":re.compile("pattern")} ) instead of precompiling the pattern.
>
> Myself, I have the following code -
>
>>>>z = []
>>>>x = q.findNext("a", {"href":re.compile(".*?thread/[0-9]*?/.*",
>
> re.IGNORECASE)})
>
>
>>>>while x:
>
> ... num = x.findNext("td", "tableColA")
> ... h = (x.contents[0],x.attrMap["href"],num.contents[0])
> ... z.append(h)
> ... x = x.findNext("a",{"href":re.compile(".*?thread/[0-9]*?/.*",
> re.IGNORECASE)})
> ...
>
> This gives me a correct set of results. However, using the following -
>
>
>>>>z = []
>>>>pattern = re.compile(".*?thread/[0-9]*?/.*", re.IGNORECASE)
>>>>x = q.findNext("a", {"href":pattern)})
>
>
>>>>while x:
>
> ... num = x.findNext("td", "tableColA")
> ... h = (x.contents[0],x.attrMap["href"],num.contents[0])
> ... z.append(h)
> ... x = x.findNext("a",{"href":pattern} )
>
> will only return the first found tag.
>
> Is the regex only evaluated once or similar?
I don't know why there should be any difference unless BS modifies the compiled regex
object and for some reason needs a fresh one each time. That would be odd and I don't see
it in the source code.
The code above has a syntax error (extra paren in the first findNext() call) - can you
post the exact non-working code?
>
> (Also any pointers on how to get negative lookahead matching working
> would be great.
> the regex (/thread/[0-9]*)(?!\/) still matches "/thread/28606/" and
> I'd assumed it wouldn't.
Putting these expressions into Regex Demo is enlightening - the regex matches against
"/thread/2860" - in other words the "not /" is matching against the 6.
You don't give an example of what you do want to match so it's hard to know what a better
solution is. Some possibilities
- match anything except a digit or a slash - [^0-9/]
- match the end of the string - $
- both of the above - ([^0-9/]|$)
Kent
>
> Regards,
>
> Liam Clarke
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>
More information about the Tutor
mailing list