[Tutor] RegEx query

Kent Johnson kent37 at tds.net
Sat Dec 17 14:20:32 CET 2005


Liam Clarke wrote:
> Hi all,
> 
> Using Beautiful Soup and regexes.. I've noticed that all the examples
> used regexes like so - anchors = parseTree.fetch("a",
> {"href":re.compile("pattern")} )  instead of precompiling the pattern.
> 
> Myself, I have the following code -
> 
>>>>z = []
>>>>x = q.findNext("a", {"href":re.compile(".*?thread/[0-9]*?/.*",
> 
> re.IGNORECASE)})
> 
> 
>>>>while x:
> 
> ... 	num = x.findNext("td", "tableColA")
> ... 	h = (x.contents[0],x.attrMap["href"],num.contents[0])
> ... 	z.append(h)
> ... 	x = x.findNext("a",{"href":re.compile(".*?thread/[0-9]*?/.*",
> re.IGNORECASE)})
> ...
> 
> This gives me a correct set of results. However, using the following -
> 
> 
>>>>z = []
>>>>pattern = re.compile(".*?thread/[0-9]*?/.*", re.IGNORECASE)
>>>>x = q.findNext("a", {"href":pattern)})
> 
> 
>>>>while x:
> 
> ... 	num = x.findNext("td", "tableColA")
> ... 	h = (x.contents[0],x.attrMap["href"],num.contents[0])
> ... 	z.append(h)
> ... 	x = x.findNext("a",{"href":pattern} )
> 
> will only return the first found tag.
> 
> Is the regex only evaluated once or similar?

I don't know why there should be any difference unless BS modifies the compiled regex 
object and for some reason needs a fresh one each time. That would be odd and I don't see 
it in the source code.

The code above has a syntax error (extra paren in the first findNext() call) - can you 
post the exact non-working code?
> 
> (Also any pointers on how to get negative lookahead matching working
> would be great.
> the regex (/thread/[0-9]*)(?!\/) still matches "/thread/28606/" and
> I'd assumed it wouldn't.

Putting these expressions into Regex Demo is enlightening - the regex matches against 
"/thread/2860" - in other words the "not /" is matching against the 6.

You don't give an example of what you do want to match so it's hard to know what a better 
solution is. Some possibilities
- match anything except a digit or a slash - [^0-9/]
- match the end of the string - $
- both of the above - ([^0-9/]|$)

Kent

> 
> Regards,
> 
> Liam Clarke
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 




More information about the Tutor mailing list